Part 1: Set Up the Pipeline

This is Part 1 of the Build an End-to-End RAG Pipeline series. Across three parts, you’ll connect Gloo AI’s features into one continuous workflow: the publisher and content you set up here are the same ones you’ll manage in Part 2 and harden in Part 3. In this part you’ll create a publisher, upload content with descriptive metadata, and verify that it’s fully indexed and ready for retrieval.

Pipeline at a Glance

Publisher setup (Studio)

Create the publisher that owns your content — covered below.

Ingest content with metadata

Upload files and enrich them — covered below, with a deep dive in Upload Files to Data Engine.

Verify indexing

Poll item status until your content is searchable — covered below.

Semantic search

Query your content — deep dive: Building Custom Search.

Grounded completions with sources

Answer questions from your content with citations — deep dive: Grounded Completions with RAG.

Content lifecycle

Update, bulk-edit, and delete content — Part 2.

Verification, errors & resilience

Error handling and retry patterns — Part 3.

Prerequisites

Before starting, ensure you have:

A Gloo AI Studio account
Your Client ID and Client Secret from the API Credentials page
Authentication setup - Complete the Authentication Tutorial first

All API calls in this series use Bearer token authentication via the OAuth2 client credentials flow. The snippets below include a minimal token fetch; see the Authentication Tutorial for token caching and expiration handling.

Step 1: Create Your Publisher

Content in the Data Engine belongs to a publisher. Create one in Gloo AI Studio:

In Gloo AI Studio, click your user account in the bottom-left corner and select Manage Organizations
Select the organization you want to add the publisher to, then click View Publishers
Click Create Publisher and give the new publisher a name
Copy the Publisher ID (a UUID) — every API call in this series uses it

See Manage Publishers for the full Studio walkthrough.

Use one dedicated publisher for this series. Parts 2 and 3 operate on the content you upload here, and a dedicated publisher keeps those operations cleanly separated from your production content.

Step 2: Upload Content with a Producer ID

Upload a file to POST /ingestion/v2/files. The producer_id query parameter attaches your own stable identifier to the item — this is what makes the pipeline manageable later: re-running the upload detects a duplicate instead of creating a copy, and in Part 2 you’ll update, bulk-edit, and delete these same items. This series uses a short Markdown article as its sample content: grab building-stronger-communities.md from the cookbook repository and save it next to your script. (Any Markdown, text, PDF, or Word file of your own works too — just adjust the filename.) Then upload it:

import requests

CLIENT_ID = "your_client_id"
CLIENT_SECRET = "your_client_secret"
PUBLISHER_ID = "your_publisher_id"
PRODUCER_ID = "rag-pipeline-part1-building-stronger-communities"

# Get an access token (see the Authentication tutorial)
token = requests.post(
    "https://platform.ai.gloo.com/oauth2/token",
    data={"grant_type": "client_credentials", "scope": "api/access"},
    auth=(CLIENT_ID, CLIENT_SECRET),
).json()["access_token"]

# Upload the file with a stable producer ID
with open("building-stronger-communities.md", "rb") as f:
    response = requests.post(
        "https://platform.ai.gloo.com/ingestion/v2/files",
        headers={"Authorization": f"Bearer {token}"},
        params={"producer_id": PRODUCER_ID},
        files={"files": ("building-stronger-communities.md", f)},
        data={"publisher_id": PUBLISHER_ID},
    )
result = response.json()

# A fresh upload returns the new item ID in "ingesting";
# re-uploading the same content returns it in "duplicates".
item_id = (result["ingesting"] or result["duplicates"])[0]
print(f"Item ID: {item_id}")

What You’ll See

Item ID: 822308cc-72fc-478d-a2eb-fbdf01a6a15d

Keep this item ID — the next two steps use it, and Parts 2 and 3 operate on this same item.

For multi-file uploads, batch processing, and supported file types, see the Upload Files to Data Engine deep dive.

Step 3: Enrich with Metadata

Attach descriptive metadata with PATCH /engine/v2/item. Good metadata pays off downstream: titles and authors appear in search results and source citations, and tags let you organize and bulk-manage content in Part 2.

metadata = {
    "publisher_id": PUBLISHER_ID,
    "item_id": item_id,
    "item_title": "Building Stronger Communities Through Service",
    "item_summary": "Practical guidance for starting and sustaining community service efforts.",
    "author": ["Gloo AI Docs Team"],
    "item_tags": ["community", "service", "rag-pipeline-series"],
}
response = requests.patch(
    "https://platform.ai.gloo.com/engine/v2/item",
    headers={"Authorization": f"Bearer {token}", "Content-Type": "application/json"},
    json=metadata,
)
response.raise_for_status()
print("Metadata set")

Metadata can be set as soon as the item ID exists — you don’t need to wait for ingestion to finish. You can also target a single item by producer_id instead of item_id.

Step 4: Verify Indexing

Ingestion is asynchronous: the upload response means your file is queued, not searchable. Poll GET /engine/v2/items/{item_id} until status reaches COMPLETED — typically a few minutes for a small file. While processing you’ll see intermediate states such as CHUNKING.

import time

POLL_INTERVAL_SECONDS = 15
POLL_TIMEOUT_SECONDS = 600

deadline = time.time() + POLL_TIMEOUT_SECONDS
while time.time() < deadline:
    item = requests.get(
        f"https://platform.ai.gloo.com/engine/v2/items/{item_id}",
        headers={"Authorization": f"Bearer {token}"},
    ).json()
    status = item.get("status", "unknown")
    print(f"Status: {status}")

    if status.upper() == "COMPLETED":
        print(f"Indexed: {item['item_title']} (tags: {', '.join(item['item_tags'])})")
        break
    if status.upper() in ("FAILED", "ERROR"):
        raise RuntimeError(f"Ingestion failed with status: {status}")

    time.sleep(POLL_INTERVAL_SECONDS)
else:
    raise TimeoutError(f"Not indexed within {POLL_TIMEOUT_SECONDS}s")

What You’ll See

For a fresh upload (about 6 minutes for the sample file):

Status: QUEUED
Status: CHUNKING
...
Status: COMPLETED
Indexed: Building Stronger Communities Through Service (tags: community, service, rag-pipeline-series)

If you re-run against already-indexed content, the first poll returns COMPLETED immediately. The metadata you set in Step 3 round-trips on the same response — confirming title, author, and tags are attached to the indexed item.

Run the Complete Example

The cookbook contains the full pipeline — upload, metadata, and polling with token caching and error handling — as one runnable program in all six languages. From the cookbook repository, install dependencies, copy .env.example to .env and add your credentials, then run it:

cd rag-pipeline-part-1/python
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env       # then add your Client ID, Secret, and Publisher ID
python main.py

You’ll see:

Step 1: Uploading sample content...
  Queued for ingestion: 822308cc-72fc-478d-a2eb-fbdf01a6a15d

Step 2: Setting item metadata...
  Metadata set: title, summary, author, 3 tags

Step 3: Verifying indexing (polling)...
  Status: QUEUED
  Status: COMPLETED

Pipeline content is indexed and ready.
  Item ID:  822308cc-72fc-478d-a2eb-fbdf01a6a15d
  Title:    Building Stronger Communities Through Service
  Author:   Gloo AI Docs Team
  Tags:     community, service, rag-pipeline-series
  Status:   COMPLETED

Working Code Sample

View Complete Code

Clone or browse the complete working examples for all 6 languages (JavaScript, TypeScript, Python, PHP, Go, Java) with setup instructions and the sample content file.

The code snippets above are simplified and self-contained — designed for readability and easy copy-paste. The cookbook examples add token caching, duplicate handling, and structured error handling. Both implement the same APIs and patterns.

Troubleshooting

Error: 401 Unauthorized

Your access token is missing, expired, or malformed. Tokens expire after one hour — re-run the token request. See the Authentication Tutorial.

Error: 400 publisher_not_found

The publisher_id doesn’t exist or isn’t accessible with your credentials. Copy the Publisher ID (a UUID) from Studio > Data Engine > Publishers.

Error: 400 too_many_files

producer_id applies to a single file. When uploading multiple files in one request, omit it.

Polling times out

Larger files take longer to process. Increase the timeout, or check Studio > Ingestion Analytics to see whether the publisher is still processing.

Next Steps

Your pipeline now has indexed, metadata-rich content. Wire it into retrieval:

Building Custom Search — query this content semantically with the Search API
Grounded Completions with RAG — answer questions from this content with source citations
Part 2: Content Lifecycle — update, bulk-edit, and delete the items you created here
[Part 3: Verification, Error Handling & Resilience]((/tutorials/rag-pipeline-part-3) — production-grade retry and verification patterns

​Pipeline at a Glance

​Prerequisites

​Step 1: Create Your Publisher

​Step 2: Upload Content with a Producer ID

​What You’ll See

​Step 3: Enrich with Metadata

​Step 4: Verify Indexing

​What You’ll See

​Run the Complete Example

​Working Code Sample

View Complete Code

​Troubleshooting

​Error: 401 Unauthorized

​Error: 400 publisher_not_found

​Error: 400 too_many_files

​Polling times out

​Next Steps

Pipeline at a Glance

Prerequisites

Step 1: Create Your Publisher

Step 2: Upload Content with a Producer ID

What You’ll See

Step 3: Enrich with Metadata

Step 4: Verify Indexing

What You’ll See

Run the Complete Example

Working Code Sample

Troubleshooting

Error: 401 Unauthorized

Error: 400 publisher_not_found

Error: 400 too_many_files

Polling times out

Next Steps