Skip to main content
This is Part 1 of the Build an End-to-End RAG Pipeline series. Across three parts, you’ll connect Gloo AI’s features into one continuous workflow: the publisher and content you set up here are the same ones you’ll manage in Part 2 and harden in Part 3. In this part you’ll create a publisher, upload content with descriptive metadata, and verify that it’s fully indexed and ready for retrieval.

Pipeline at a Glance

1

Publisher setup (Studio)

Create the publisher that owns your content — covered below.
2

Ingest content with metadata

Upload files and enrich them — covered below, with a deep dive in Upload Files to Data Engine.
3

Verify indexing

Poll item status until your content is searchable — covered below.
4

Semantic search

Query your content — deep dive: Building Custom Search.
5

Grounded completions with sources

Answer questions from your content with citations — deep dive: Grounded Completions with RAG.
6

Content lifecycle

Update, bulk-edit, and delete content — Part 2.
7

Verification, errors & resilience

Error handling and retry patterns — Part 3.

Prerequisites

Before starting, ensure you have:
All API calls in this series use Bearer token authentication via the OAuth2 client credentials flow. The snippets below include a minimal token fetch; see the Authentication Tutorial for token caching and expiration handling.

Step 1: Create Your Publisher

Content in the Data Engine belongs to a publisher. Create one in Gloo AI Studio:
  1. In Gloo AI Studio, click your user account in the bottom-left corner and select Manage Organizations
  2. Select the organization you want to add the publisher to, then click View Publishers
  3. Click Create Publisher and give the new publisher a name
  4. Copy the Publisher ID (a UUID) — every API call in this series uses it
See Manage Publishers for the full Studio walkthrough.
Use one dedicated publisher for this series. Parts 2 and 3 operate on the content you upload here, and a dedicated publisher keeps those operations cleanly separated from your production content.

Step 2: Upload Content with a Producer ID

Upload a file to POST /ingestion/v2/files. The producer_id query parameter attaches your own stable identifier to the item — this is what makes the pipeline manageable later: re-running the upload detects a duplicate instead of creating a copy, and in Part 2 you’ll update, bulk-edit, and delete these same items. This series uses a short Markdown article as its sample content: grab building-stronger-communities.md from the cookbook repository and save it next to your script. (Any Markdown, text, PDF, or Word file of your own works too — just adjust the filename.) Then upload it:
import requests

CLIENT_ID = "your_client_id"
CLIENT_SECRET = "your_client_secret"
PUBLISHER_ID = "your_publisher_id"
PRODUCER_ID = "rag-pipeline-part1-building-stronger-communities"

# Get an access token (see the Authentication tutorial)
token = requests.post(
    "https://platform.ai.gloo.com/oauth2/token",
    data={"grant_type": "client_credentials", "scope": "api/access"},
    auth=(CLIENT_ID, CLIENT_SECRET),
).json()["access_token"]

# Upload the file with a stable producer ID
with open("building-stronger-communities.md", "rb") as f:
    response = requests.post(
        "https://platform.ai.gloo.com/ingestion/v2/files",
        headers={"Authorization": f"Bearer {token}"},
        params={"producer_id": PRODUCER_ID},
        files={"files": ("building-stronger-communities.md", f)},
        data={"publisher_id": PUBLISHER_ID},
    )
result = response.json()

# A fresh upload returns the new item ID in "ingesting";
# re-uploading the same content returns it in "duplicates".
item_id = (result["ingesting"] or result["duplicates"])[0]
print(f"Item ID: {item_id}")

What You’ll See

Item ID: 822308cc-72fc-478d-a2eb-fbdf01a6a15d
Keep this item ID — the next two steps use it, and Parts 2 and 3 operate on this same item.
For multi-file uploads, batch processing, and supported file types, see the Upload Files to Data Engine deep dive.

Step 3: Enrich with Metadata

Attach descriptive metadata with PATCH /engine/v2/item. Good metadata pays off downstream: titles and authors appear in search results and source citations, and tags let you organize and bulk-manage content in Part 2.
metadata = {
    "publisher_id": PUBLISHER_ID,
    "item_id": item_id,
    "item_title": "Building Stronger Communities Through Service",
    "item_summary": "Practical guidance for starting and sustaining community service efforts.",
    "author": ["Gloo AI Docs Team"],
    "item_tags": ["community", "service", "rag-pipeline-series"],
}
response = requests.patch(
    "https://platform.ai.gloo.com/engine/v2/item",
    headers={"Authorization": f"Bearer {token}", "Content-Type": "application/json"},
    json=metadata,
)
response.raise_for_status()
print("Metadata set")
Metadata can be set as soon as the item ID exists — you don’t need to wait for ingestion to finish. You can also target a single item by producer_id instead of item_id.

Step 4: Verify Indexing

Ingestion is asynchronous: the upload response means your file is queued, not searchable. Poll GET /engine/v2/items/{item_id} until status reaches COMPLETED — typically a few minutes for a small file. While processing you’ll see intermediate states such as CHUNKING.
import time

POLL_INTERVAL_SECONDS = 15
POLL_TIMEOUT_SECONDS = 600

deadline = time.time() + POLL_TIMEOUT_SECONDS
while time.time() < deadline:
    item = requests.get(
        f"https://platform.ai.gloo.com/engine/v2/items/{item_id}",
        headers={"Authorization": f"Bearer {token}"},
    ).json()
    status = item.get("status", "unknown")
    print(f"Status: {status}")

    if status.upper() == "COMPLETED":
        print(f"Indexed: {item['item_title']} (tags: {', '.join(item['item_tags'])})")
        break
    if status.upper() in ("FAILED", "ERROR"):
        raise RuntimeError(f"Ingestion failed with status: {status}")

    time.sleep(POLL_INTERVAL_SECONDS)
else:
    raise TimeoutError(f"Not indexed within {POLL_TIMEOUT_SECONDS}s")

What You’ll See

For a fresh upload (about 6 minutes for the sample file):
Status: QUEUED
Status: CHUNKING
...
Status: COMPLETED
Indexed: Building Stronger Communities Through Service (tags: community, service, rag-pipeline-series)
If you re-run against already-indexed content, the first poll returns COMPLETED immediately. The metadata you set in Step 3 round-trips on the same response — confirming title, author, and tags are attached to the indexed item.

Run the Complete Example

The cookbook contains the full pipeline — upload, metadata, and polling with token caching and error handling — as one runnable program in all six languages. From the cookbook repository, install dependencies, copy .env.example to .env and add your credentials, then run it:
cd rag-pipeline-part-1/python
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env       # then add your Client ID, Secret, and Publisher ID
python main.py
You’ll see:
Step 1: Uploading sample content...
  Queued for ingestion: 822308cc-72fc-478d-a2eb-fbdf01a6a15d

Step 2: Setting item metadata...
  Metadata set: title, summary, author, 3 tags

Step 3: Verifying indexing (polling)...
  Status: QUEUED
  Status: COMPLETED

Pipeline content is indexed and ready.
  Item ID:  822308cc-72fc-478d-a2eb-fbdf01a6a15d
  Title:    Building Stronger Communities Through Service
  Author:   Gloo AI Docs Team
  Tags:     community, service, rag-pipeline-series
  Status:   COMPLETED

Working Code Sample

View Complete Code

Clone or browse the complete working examples for all 6 languages (JavaScript, TypeScript, Python, PHP, Go, Java) with setup instructions and the sample content file.
The code snippets above are simplified and self-contained — designed for readability and easy copy-paste. The cookbook examples add token caching, duplicate handling, and structured error handling. Both implement the same APIs and patterns.

Troubleshooting

Error: 401 Unauthorized

Your access token is missing, expired, or malformed. Tokens expire after one hour — re-run the token request. See the Authentication Tutorial.

Error: 400 publisher_not_found

The publisher_id doesn’t exist or isn’t accessible with your credentials. Copy the Publisher ID (a UUID) from Studio > Data Engine > Publishers.

Error: 400 too_many_files

producer_id applies to a single file. When uploading multiple files in one request, omit it.

Polling times out

Larger files take longer to process. Increase the timeout, or check Studio > Ingestion Analytics to see whether the publisher is still processing.

Next Steps

Your pipeline now has indexed, metadata-rich content. Wire it into retrieval:
  1. Building Custom Search — query this content semantically with the Search API
  2. Grounded Completions with RAG — answer questions from this content with source citations
  3. Part 2: Content Lifecycle — update, bulk-edit, and delete the items you created here
  4. [Part 3: Verification, Error Handling & Resilience]((/tutorials/rag-pipeline-part-3) — production-grade retry and verification patterns