Harden your Gloo AI integration: interpret API errors, retry transient failures with backoff, and verify ingestion health.
This is Part 3 of the Build an End-to-End RAG Pipeline series. Parts 1 and 2 walked the happy path. Production code can’t assume it: requests fail, services blip, and you need to confirm that an operation actually took effect. This part builds a small resilient client and uses it to interpret API errors, retry transient failures, and verify ingestion health.
Gloo AI does not include a monitoring or health-check endpoint. Resilience is built from the same item APIs you’ve already used, plus disciplined error handling on the client side.
A resilient client does two things on every request: it turns failures into a normalized error (status, code, message) so callers can react to them, and it retries only transient failures — server errors (500, 502, 503, 504) and network blips — with exponential backoff. Client errors (400, 401, 403, 404, 422) are bugs in the request, not blips, so they fail fast.The Data Engine returns a few error shapes — {"detail": {"code", "message"}}, {"detail": "..."}, and {"error", "message"} — so the parser normalizes all of them.
import timeimport requestsRETRYABLE = {500, 502, 503, 504}MAX_RETRIES = 4class ApiError(Exception): def __init__(self, status, code, message): super().__init__(f"[{status} {code}] {message}") self.status, self.code, self.message = status, code, messagedef parse_error(response): try: body = response.json() except ValueError: return None, response.reason detail = body.get("detail") if isinstance(body, dict) else None if isinstance(detail, dict): return detail.get("code"), detail.get("message") or response.reason if isinstance(detail, str): return None, detail return body.get("error"), body.get("message") or response.reasondef request(method, url, token, **kwargs): for attempt in range(MAX_RETRIES + 1): response = requests.request( method, url, headers={"Authorization": f"Bearer {token}"}, timeout=30, **kwargs ) if response.status_code in RETRYABLE and attempt < MAX_RETRIES: delay = 2 ** attempt print(f" Attempt {attempt + 1} failed ({response.status_code}); retrying in {delay}s") time.sleep(delay) continue if not response.ok: code, message = parse_error(response) raise ApiError(response.status_code, code, message) return response.json()
These snippets are simplified for readability. The cookbook client also refreshes the access token once on a 401, retries network-level failures, and applies the same retry policy to multipart uploads. Both implement the same patterns.
With request and parse_error in place, error handling becomes uniform: catch the normalized error and read its status, code, and message. The calls below deliberately trigger three common failures — a missing item (404), a malformed ID (400), and a rejected token (403).
Missing item (random UUID): status=404 code='Item not found' message='The requested item does not exist or has been permanently deleted' Malformed item ID: status=400 code='Invalid item ID format' message='Item ID must be a valid UUID format' Rejected bearer token: status=403 code='Forbidden - insufficient permissions' message='Forbidden'
A rejected or malformed token returns 403, not 401. A genuinely expired token typically returns 401 — which is why the cookbook client refreshes the token once on a 401 and retries. It does not refresh on 403, since 403 can be a legitimate permission denial (for example, an item that belongs to another publisher).
Transient failures — a 503, a dropped connection — should be retried, not surfaced. The retry loop from Step 1 already does this; here it is in isolation, recovering from a service that fails twice before succeeding.
A healthy API won’t return a 5xx on demand, so this example simulates a transient failure to exercise the backoff path. In production the same path handles real 5xx responses and network errors.
RETRYABLE_DELAYS = [2 ** i for i in range(MAX_RETRIES)]calls = 0def flaky(): global calls calls += 1 if calls < 3: raise ApiError(503, "service_unavailable", "Service temporarily unavailable") return {"ok": True}for attempt in range(MAX_RETRIES + 1): try: flaky() break except ApiError as e: if e.status in RETRYABLE and attempt < MAX_RETRIES: delay = RETRYABLE_DELAYS[attempt] print(f" Attempt {attempt + 1} failed ({e.status}: {e.code}); retrying in {delay}s") time.sleep(delay) else: raiseprint(f" Succeeded after {calls} attempts")
There’s no health endpoint, so you verify by checking the items you care about. Upload a batch, wait for indexing, then fetch each item’s status and roll it up into a summary — treating a 404 as “not found” rather than an error so one missing item doesn’t abort the check. The example deliberately includes a random ID to show that path.
# item_ids: uploaded and indexed via the resilient client (see the cookbook)to_check = item_ids + [str(uuid.uuid4())]summary = {"completed": 0, "pending": 0, "failed": 0, "not_found": 0}for item_id in to_check: try: status = request("GET", f"{ITEMS_URL}/{item_id}", token).get("status", "").upper() if status == "COMPLETED": summary["completed"] += 1 elif status in ("FAILED", "ERROR"): summary["failed"] += 1 else: summary["pending"] += 1 except ApiError as e: if e.status == 404: summary["not_found"] += 1 else: raiseprint(f" Health: {summary['completed']} completed, {summary['pending']} pending, " f"{summary['failed']} failed, {summary['not_found']} not found")
The cookbook ties it together — a resilient client that handles errors, retries, and a full upload → verify → clean-up health check — in all six languages. From the cookbook repository, install dependencies, copy .env.example to .env and add your credentials, then run it:
cd rag-pipeline-part-3/pythonpython3 -m venv venvsource venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtcp .env.example .env # then add your Client ID, Secret, and Publisher IDpython main.py
You’ll see:
Step 1: Resilient client ready (token refresh, error parsing, retry/backoff).Step 2: Interpreting API error responses... Missing item (random UUID): status=404 code='Item not found' message='The requested item does not exist or has been permanently deleted' Malformed item ID: status=400 code='Invalid item ID format' message='Item ID must be a valid UUID format' Rejected bearer token: status=403 code='Forbidden - insufficient permissions' message='Forbidden'Step 3: Retrying transient failures with backoff... Attempt 1 failed (503: service_unavailable); retrying in 1s Attempt 2 failed (503: service_unavailable); retrying in 2s Succeeded after 3 attemptsStep 4: Verifying ingestion health... Uploading and indexing a batch... Waiting for 2 item(s) to finish indexing... Health: 2 completed, 0 pending, 0 failed, 1 not found Cleaned up 2 item(s)Done. The resilient client handled errors, retries, and verification end to end.
Clone or browse the complete resilient client for all 6 languages (JavaScript, TypeScript, Python, PHP, Go, Java) with setup instructions and the sample content files.
Check your backoff cap. With base-2 exponential backoff and MAX_RETRIES = 4, the waits are 1s, 2s, 4s, 8s. Lower MAX_RETRIES or cap the delay for latency-sensitive paths.
Some error responses carry a code but no message. The parser falls back to the HTTP reason phrase (for example, Forbidden) so you always have something to log.
A malformed or rejected token returns 403. Re-check the token; if it’s genuinely expired you’ll typically get a 401, which the cookbook client refreshes automatically. See the Authentication Tutorial.
That completes the Build an End-to-End RAG Pipeline series — you’ve set up a publisher and ingested content (Part 1), managed its lifecycle (Part 2), and made the integration resilient (Part 3). From here: