Skip to main content
When you stream Completions V2 responses (stream: true), failures can arrive after the HTTP 200 text/event-stream response has already started. By then your application may have rendered partial output to the user, so a naive “just retry the request” strategy can erase visible text, duplicate it, or silently replace one answer with a different one. This guide covers how to:
  • classify stream-time failures using the structured error object
  • choose between preserving partial output, bounded retry, application-level continuation, and surfacing an interruption
  • apply concrete retry ceilings and backoff
  • avoid duplicating tool-call side effects
  • log the right fields for support and observability
The reference implementations are in Python and TypeScript. Building in Go, Java, PHP, or another language? The language-neutral streaming error object table and decision tree below are the source of truth — port the same logic.

Streaming Failure Phases

A streamed completion can fail in four distinct ways. Handle each differently:
PhaseWhat you observeHandling
Pre-stream HTTP errorNon-2xx HTTP response before any SSE bytesStandard HTTP error handling — see Errors
Stream error (Signal A)SSE event with choices[0].finish_reason: "error" and a structured error objectThe retry/continuation logic in this guide
Transport disconnectThe connection drops with no structured SSE errorTreat as plausibly transient; same decision tree as Signal A
Content filter (Signal B)SSE event with finish_reason: "content_filter" and no error objectTerminal. Surface to the user; never retry or continue

Two distinct stream-time signals

Split detection into two signals so your client never misclassifies a content-filter stop as a clean early completion — or as a retryable error:
  • Signal A — structured stream error. choices[0].finish_reason is "error" and the event carries the structured error object described below. This is the only path that retry and continuation logic applies to.
  • Signal B — content-filter termination. choices[0].finish_reason is "content_filter". The event carries no error object. It is terminal and non-retryable: surface it to the user, and never retry or continue automatically.

The streaming error object

A Signal A event includes a structured error object with nine fields, plus a top-level trace_id on the SSE chunk itself. This is the canonical reference — every section below branches on these fields.
{
  "id": "chatcmpl-error-abc123",
  "object": "chat.completion.chunk",
  "created": 1769792268,
  "model": "gloo-anthropic-claude-sonnet-4.6",
  "choices": [
    {
      "delta": { "content": null, "role": null },
      "finish_reason": "error",
      "index": 0
    }
  ],
  "error": {
    "message": "An unexpected provider error occurred mid-stream.",
    "type": "internal_error",
    "code": 3001,
    "name": "INTERNAL_ERROR",
    "category": "provider_error",
    "description": "An unexpected provider error occurred mid-stream.",
    "fault": "provider",
    "retryable": true,
    "trace_id": "trace-id"
  },
  "trace_id": "trace-id"
}
FieldBranch on it?Notes
error.retryableYes — primary retry signalWhether retrying the same request may succeed
error.faultYes — secondaryResponsible party: client, provider, or internal
error.codeYes — classification after retryable/faultNumeric internal code, such as 3001
error.nameYes — classificationSymbolic name, such as INTERNAL_ERROR
error.typeNo — observability/display onlyStable wire-format error type
error.categoryNo — observability/display onlyclient_error, provider_error, or platform_error
error.descriptionNo — display onlyStable explanation of the error code
error.messageNo — display onlyRequest-specific customer-facing message
error.trace_idNo — observabilitySee trace ID precedence
top-level trace_idNo — observabilityPresent on every SSE chunk; see trace ID precedence

Read the Whole Error Object

The same error.code can represent different situations, so never branch on code alone. 3001 / INTERNAL_ERROR is the clearest example:
  • A retryable stream-time provider anomaly surfaces as code: 3001, name: "INTERNAL_ERROR", fault: "provider", retryable: true.
  • A platform-internal 3001 surfaces with fault: "internal" and retryable: false.
Branch on error.retryable first, then use fault, code, and name for classification and observability:
  1. error.retryable — should this request be retried at all?
  2. error.fault — was it a provider issue or internal to the platform?
  3. error.code / error.name — what exactly happened (for logs, metrics, alerts)?
Treating every 3001 / INTERNAL_ERROR as non-retryable will drop recoverable streams; treating every 3001 as retryable will loop on genuine platform errors. Read error.retryable.

Decision Tree

Two independent axes govern recovery — keep them separate so “attempt” is never ambiguous:
  • Which operation? Driven by whether content was already shown to the user.
    • Not shown → full retry (a fresh stream).
    • Shown → Continue (application-level continuation — never a full retry, which would force erasing or duplicating visible text). Continuation is exclusive to the post-content phase.
  • How many attempts? Driven by context (ceilings below). Always use full-jitter backoff; always stop early on retryable: false.
The flow:
  1. Stream fails before any visible content
    • Retry only when error.retryable is true, or the failure is a transport error that is plausibly transient.
    • Ceiling: 2 full-retry attempts for live UX.
  2. Stream fails after visible content (post-content recovery)
    • Preserve the partial output. Never silently replace the visible answer.
    • If a side-effecting tool call was emitted or executed this turn, skip the automatic attempt (see Tool Calls and Side Effects) and surface the interruption directly.
    • Otherwise attempt exactly one automatic Continue (continuation).
    • If the automatic Continue fails or was skipped, mark the answer interrupted and surface a user-initiated Continue (primary, non-destructive) and an explicitly labelled, destructive Try again (full retry, secondary).
    • “At most one” means one automatic attempt — never a silent loop. User-initiated actions are user-governed and don’t count against the automatic budget.
    • For high-stakes, factual, or doctrinal answers, steer away from automatic Continue toward Try again or explicit user review (see Application-Level Continuation).
  3. Background/batch workflow
    • Up to 3 full-retry attempts with jittered exponential backoff and an overall job timeout. Partial output can be discarded safely because nothing was shown to a user.
  4. error.retryable is false
    • Do not retry unchanged. Surface the error, change the request or model, or escalate with the trace ID.
  5. finish_reason is "content_filter" (Signal B)
    • Terminal and non-retryable. Surface it; never retry or continue.

Application-Level Continuation

Completions V2 does not expose true stream resumption — there is no cursor, offset, replay token, or resume endpoint. What your application can do instead is application-level continuation: a new request that includes the partial output as context and asks the model to continue. Send the partial output inside a new user message:
{
  "messages": [
    {
      "role": "user",
      "content": "Explain Romans 8 in simple terms."
    },
    {
      "role": "user",
      "content": "The previous streamed answer was interrupted after this text:\n\nRomans 8 teaches that...\n\nContinue from that point without repeating the text above."
    }
  ],
  "auto_routing": true,
  "stream": true
}
Do not end the continuation request with an assistant message containing the partial text — final assistant turns are not supported uniformly across models and routing modes, while the user-message pattern works everywhere.
Continuation can drift. A continuation is a new request: it can repeat, drift, or complete differently from the failed stream. Because Continue appends with no visual seam, the consumer reads the result as one continuous answer. Two consequences for you as the developer:
  • High-stakes, factual, or doctrinal content: avoid automatic Continue. Prefer Try again (a clean replacement) or explicit user review, since drift can introduce unflagged errors mid-answer.
  • The prompt contract is best-effort. Instructing the model to “continue without repeating” usually works, but your application should still de-duplicate overlap at the seam.

Retry Budgets and Backoff

Recommended defaults:
ContextOperationCeiling
Live, no visible content yetFull retry2 attempts
Live, visible partial contentAutomatic Continue (continuation)1 attempt, then user-initiated controls
Background/batchFull retry3 attempts
  • Use full-jitter exponential backoff between attempts.
  • Cap live UX delays aggressively (a couple of seconds at most); use a broader cap for background jobs.
  • Stop early the moment an error reports retryable: false.
  • Do not blindly retry five times for live streams — generic rate-limit retry examples are not a live-streaming UX policy.
Both reference implementations use the same full-jitter formula:
import random

def full_jitter_delay(attempt: int, base: float = 0.5, cap: float = 2.0) -> float:
    """Full-jitter backoff: random() * min(cap, base * 2**attempt)."""
    return random.random() * min(cap, base * (2**attempt))

Rate Limits

Rate limits are a separate concern from generic retryable stream failures:
  • An HTTP 429 before the stream starts is a normal rate-limit response. Respect Retry-After or X-RateLimit-Reset when the response actually includes those headers, and back off with jitter. See Rate Limits.
  • A rate limit that occurs after the stream has started (for example, an upstream provider limit) arrives as a Signal A stream error such as 2004 / RATE_LIMIT. Handle it through the normalized SSE error object — stream-time SSE error events do not carry Retry-After, X-RateLimit-Reset, or provider rate-limit headers.
  • Rate-limit retries should use jittered backoff and must not run indefinitely.

Partial Output UX

When recovery involves text a user has already seen:
  • Keep partial text visible. Never make displayed output disappear silently.
  • Mark the answer as interrupted if recovery fails, so the user knows it is incomplete.
  • Offer user-visible controls where appropriate: Continue (primary, appends) and Try again (secondary, explicitly labelled as replacing the current answer).
  • If full retry is chosen, make the replacement explicit — the user should understand the visible answer is being replaced, not extended.
  • Retain enough application state to know whether any content was displayed; that single flag drives the choice between full retry and continuation.

Tool Calls and Side Effects

Full retries and continuations can duplicate side effects if the model already emitted a tool call and your application executed it — sending an email twice, charging a card twice.
  • Use idempotency keys or operation IDs for tool execution, so a replayed call can be detected and dropped.
  • Record, per turn, whether a tool call was emitted and whether it was executed, before any retry decision.
  • Do not automatically replay side-effecting tool calls unless your application can prove the prior attempt did not execute.
  • If a side-effecting tool call fired in the interrupted turn, skip the automatic Continue and surface the interruption — prefer fail-fast or explicit user confirmation over silent recovery.

Observability Checklist

Log the full streaming error object — including the non-branching fields error.type, error.category, and error.description — even though retry decisions only use retryable, fault, code, and name. Trace IDs. Log one normalized field named trace_id, with this precedence:
event.error?.trace_id ?? event.trace_id ?? response.headers.get("x-sentry-trace-id")
If event.error.trace_id and the top-level event.trace_id both exist and differ, log both — and include both in support requests. Recommended fields per failed or recovered stream:
  • normalized trace_id (plus the secondary trace ID on mismatch)
  • error.code, error.name, error.type, error.fault, error.retryable
  • selected model (or requested model) and routing mode
  • whether content was displayed, and the partial output length
  • attempt number and backoff delay
  • whether continuation or full retry was used
  • whether tool calls were emitted or executed

Summary Table

ConditionRecommended actionRetry ceilingUX note
Pre-stream retryable error (408, 429, 5xx)Full retry with backoff2 (live)Nothing rendered yet; retry is invisible
Stream error before first tokenFull retry if retryable: true2 full retriesShow a loading state, not an error, until the budget is spent
Stream error after partial outputOne automatic Continue, then user controls1 automatic continuationPreserve partial text; mark interrupted on failure
Rate limit (HTTP 429 pre-stream)Back off; honor Retry-After/X-RateLimit-Reset if presentBounded, jitteredConsider a “busy, retrying” indicator
Rate limit (stream event 2004)Treat as Signal A via error objectSame as stream phaseNo retry headers exist on SSE errors
Content filter (finish_reason: "content_filter")Surface; never retry or continue0Terminal signal, not an error object
Non-retryable error (retryable: false)Stop; surface, change request/model, or escalate0Include trace_id in support requests
Side-effecting tool workflowSkip automatic recovery; fail fast or confirm0 automaticRequire explicit user action
Background/batch workflowFull retry with backoff and job timeout3 full retriesPartial output can be discarded safely

Reference Implementations

The modules below implement the full decision tree: SSE parsing, the two signals, trace-ID normalization, retry ceilings, full-jitter backoff, the one-automatic-Continue rule, and the tool-call side-effect guard. Copy either one into your project as a small client module. Usage:
result = stream_with_recovery(
    "https://platform.ai.gloo.com/ai/v2/chat/completions",
    {"Authorization": f"Bearer {access_token}"},
    {
        "messages": [{"role": "user", "content": "Explain Romans 8 in simple terms."}],
        "auto_routing": True,
        "stream": True,
    },
    on_delta=lambda text: print(text, end="", flush=True),
)
# result.status is "complete", "content_filter", "interrupted", or "failed"
# result.text always contains everything safe to display
Recovery:
# Standard library plus `requests`:

"""Streaming recovery client for Gloo AI Completions V2."""

import json
import random
import time
from dataclasses import dataclass

import requests

LIVE_PRE_TOKEN_RETRIES = 2     # full-retry attempts before any content is visible
BACKGROUND_RETRIES = 3         # full-retry attempts for background/batch jobs
LIVE_BACKOFF_CAP_S = 2.0       # keep live UX delays short
BACKGROUND_BACKOFF_CAP_S = 30.0


def full_jitter_delay(attempt: int, base: float = 0.5, cap: float = LIVE_BACKOFF_CAP_S) -> float:
    """Full-jitter backoff: random() * min(cap, base * 2**attempt)."""
    return random.random() * min(cap, base * (2**attempt))


@dataclass
class StreamOutcome:
    """Everything one stream attempt tells you, including observability fields."""

    partial_text: str = ""
    received_content: bool = False
    tool_call_emitted: bool = False
    finish_reason: str | None = None
    error: dict | None = None
    trace_id: str | None = None
    secondary_trace_id: str | None = None  # set when error-level and top-level IDs differ
    transport_error: str | None = None
    http_status: int | None = None

    @property
    def completed(self) -> bool:
        return self.finish_reason in ("stop", "length", "tool_calls")

    @property
    def content_filtered(self) -> bool:
        return self.finish_reason == "content_filter"

    @property
    def retryable(self) -> bool:
        if self.transport_error is not None:
            return True  # transport drops are plausibly transient
        return bool(self.error and self.error.get("retryable"))


@dataclass
class RecoveryResult:
    # "complete" | "content_filter" | "interrupted" | "failed"
    status: str
    text: str
    last_outcome: StreamOutcome
    used_continuation: bool = False
    full_retries_used: int = 0


def stream_once(url: str, headers: dict, body: dict, on_delta=None) -> StreamOutcome:
    """Run one streaming request and fold every SSE event into a StreamOutcome."""
    outcome = StreamOutcome()
    try:
        with requests.post(url, headers=headers, json=body, stream=True, timeout=120) as response:
            outcome.http_status = response.status_code
            if response.status_code != 200:
                # Pre-stream HTTP error: no SSE events will follow.
                outcome.transport_error = f"HTTP {response.status_code}"
                return outcome
            # Lowest-precedence trace source; SSE events override it below.
            outcome.trace_id = response.headers.get("x-sentry-trace-id")
            for line in response.iter_lines(decode_unicode=True):
                if not line or not line.startswith("data:"):
                    continue
                payload = line[len("data:"):].strip()
                if payload == "[DONE]":
                    break
                _apply_event(outcome, json.loads(payload), on_delta)
                if outcome.finish_reason is not None:
                    break
    except requests.RequestException as exc:
        outcome.transport_error = str(exc)
    return outcome


def _apply_event(outcome: StreamOutcome, event: dict, on_delta=None) -> None:
    # trace_id precedence: error-level, then top-level, then response header.
    error_trace = (event.get("error") or {}).get("trace_id")
    top_trace = event.get("trace_id")
    outcome.trace_id = error_trace or top_trace or outcome.trace_id
    if error_trace and top_trace and error_trace != top_trace:
        outcome.secondary_trace_id = top_trace  # log both, send both to support

    choice = (event.get("choices") or [{}])[0]
    delta = choice.get("delta") or {}
    if delta.get("content"):
        outcome.partial_text += delta["content"]
        outcome.received_content = True
        if on_delta:
            on_delta(delta["content"])
    if delta.get("tool_calls") or delta.get("function_call"):
        outcome.tool_call_emitted = True
    if choice.get("finish_reason"):
        outcome.finish_reason = choice["finish_reason"]
    if event.get("error"):
        outcome.error = event["error"]


def continuation_messages(messages: list[dict], partial_text: str) -> list[dict]:
    """Provider-neutral continuation: partial output goes in a NEW user message."""
    return [
        *messages,
        {
            "role": "user",
            "content": (
                "The previous streamed answer was interrupted after this text:\n\n"
                f"{partial_text}\n\n"
                "Continue from that point without repeating the text above."
            ),
        },
    ]


def stream_with_recovery(
    url: str, headers: dict, body: dict, *, background: bool = False, on_delta=None
) -> RecoveryResult:
    """One streamed completion with bounded recovery.

    Live mode: 2 full retries before the first token; after visible content,
    exactly one automatic Continue, then surface user controls.
    Background mode: up to 3 full retries; partial output is discarded safely
    because nothing was shown to a user.
    """
    max_full_retries = BACKGROUND_RETRIES if background else LIVE_PRE_TOKEN_RETRIES
    backoff_cap = BACKGROUND_BACKOFF_CAP_S if background else LIVE_BACKOFF_CAP_S

    full_retries = 0
    while True:
        outcome = stream_once(url, headers, body, on_delta)

        if outcome.completed:
            return RecoveryResult("complete", outcome.partial_text, outcome,
                                  full_retries_used=full_retries)
        if outcome.content_filtered:
            # Terminal non-error signal: surface it; never retry or continue.
            return RecoveryResult("content_filter", outcome.partial_text, outcome,
                                  full_retries_used=full_retries)
        if not outcome.retryable:
            # retryable=false: do not retry unchanged.
            return RecoveryResult("failed", outcome.partial_text, outcome,
                                  full_retries_used=full_retries)

        if outcome.received_content and not background:
            # Content is on screen: a full retry would erase or duplicate it.
            return _recover_after_partial(url, headers, body, outcome, on_delta)

        if full_retries >= max_full_retries:
            return RecoveryResult("failed", outcome.partial_text, outcome,
                                  full_retries_used=full_retries)
        time.sleep(full_jitter_delay(full_retries, cap=backoff_cap))
        full_retries += 1


def _recover_after_partial(
    url: str, headers: dict, body: dict, first: StreamOutcome, on_delta=None
) -> RecoveryResult:
    """Post-content recovery: keep the partial text, try one automatic Continue."""
    if first.tool_call_emitted:
        # Side-effect guard: an automatic attempt could replay the tool call.
        # Mark the answer interrupted and hand control to the user.
        return RecoveryResult("interrupted", first.partial_text, first)

    time.sleep(full_jitter_delay(0))
    continuation_body = {**body, "messages": continuation_messages(body["messages"], first.partial_text)}
    second = stream_once(url, headers, continuation_body, on_delta)
    if second.completed:
        return RecoveryResult("complete", first.partial_text + second.partial_text,
                              second, used_continuation=True)

    # The one automatic Continue failed. Keep everything visible, mark the
    # answer interrupted, and surface user-initiated Continue / Try again.
    return RecoveryResult("interrupted", first.partial_text + second.partial_text,
                          second, used_continuation=True)

Errors Reference

Error codes, the AI error object, and streaming error delivery.

Rate Limits

HTTP 429 handling and backoff guidance.

Completions V2 Guide

Routing modes, streaming support, and request parameters.

Tool Use Guide

Function calling across routing modes, including streaming.