Skip to main content

Overview

The Gloo AI completions API supports streaming responses, so instead of waiting for the full answer, your application receives tokens one at a time as the model generates them. This creates a faster, more interactive user experience and is the standard pattern for chat and content-generation products. In this tutorial you’ll build a streaming client from scratch: parsing the SSE wire protocol, accumulating tokens, handling errors, and rendering output as it arrives. You’ll also build a server-side proxy that shields your API credentials from the browser.

What You’ll Build

By the end of this tutorial, you’ll have a complete streaming implementation featuring:
  • SSE stream parser that reads tokens as they arrive from the API
  • Token accumulator that assembles the full response with timing and token count
  • Streaming-aware error handler that catches auth and rate-limit errors before reading the stream
  • Terminal renderer that displays tokens in real time with a typing effect
  • Server-side proxy that relays the stream to browser clients without exposing your credentials

Understanding Server-Sent Events

When you set "stream": true in a completions request, the API switches from a single JSON response to an SSE stream. Each token arrives as a line formatted data: <json>, with blank lines separating events:
data: {"choices":[{"delta":{"content":"The"},"finish_reason":null}]}

data: {"choices":[{"delta":{"content":" resurrection"},"finish_reason":null}]}

data: {"choices":[{"delta":{"content":" is"},"finish_reason":null}]}

data: {"choices":[{"delta":{"content":"..."},"finish_reason":"stop"}]}
The stream ends when a chunk arrives with a non-null finish_reason (typically "stop").

Two Approaches: Direct vs. Proxy

This tutorial covers two ways to consume the stream:
ApproachHow it worksWhen to use
TerminalYour server calls the API directly and prints tokensBackground jobs, CLIs, server-side rendering
ProxyA lightweight server relays SSE to any external clientWeb apps, any case where browser JS would expose credentials

Prerequisites

Before starting, ensure you have:
The starter project includes a pre-built auth module. You don’t need to implement authentication in this tutorial — it’s already working in the starter code.

Getting Started with the Starter Project

This tutorial uses a hands-on approach where you’ll build the streaming client incrementally. The starter code provides complete scaffolding with TODO markers guiding each step.

Download the Starter Code

Choose your preferred language and download the starter project:

Python

Python 3.9+ · requests · Flask

JavaScript

Node.js 18+ · native fetch · Express

TypeScript

TypeScript 5+ · typed SSE chunks

PHP

PHP 8.1+ · cURL write callback

Go

Go 1.20+ · bufio.Scanner · http.Flusher

Java

Java 17+ · HttpClient · Maven

Quick Setup

cd starter/python
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env
# Edit .env with your GLOO_CLIENT_ID and GLOO_CLIENT_SECRET

Test Your Setup

Run the entry point — it should load your credentials and confirm the stubs are in place:
python main.py
You should see credentials load successfully, followed by NotImplementedError (or equivalent) from the first stub — confirming that setup is complete and you’re ready to implement.

Architecture Overview

Component Architecture

Implementation Roadmap

StepWhat You BuildTrackValidates
1Environment setupSharedAuth loads; streaming endpoint reachable
2Handle stream errorsShared401/403/429 errors thrown before stream read
3Streaming request + SSE parsingSharedHTTP connection opens; SSE lines parsed; [DONE] detected
4Token extraction + accumulationSharedToken text extracted; full response assembled with timing
5Render stream to terminalTerminalTokens print live to terminal
6Proxy stream handlerProxySSE relayed through server
7 †Testing & browser demoProxyEnd-to-end validation
† No new implementation — run the demo, test the proxy via API, and explore the browser client.
Steps 1–5 build the streaming client. Step 6 adds the server-side proxy. Step 7 walks through the browser demo. Let’s get started!

Step 1: Environment Setup & Auth Verification

The starter project includes a pre-built auth module that handles OAuth2 client credentials. Before implementing any streaming logic, confirm it works with the streaming endpoint.

What You’ll Verify

  1. Credentials load correctly from .env
  2. A token can be obtained from the Gloo AI auth server
  3. A request to the completions endpoint returns 200 OK with Content-Type: text/event-stream

Testing Your Setup

Run the Step 1 checkpoint now — it should pass with the pre-built auth:
python tests/step1_auth_test.py

✓ Checkpoint: Auth Verification

Your output should look similar to the following:
🧪 Testing: Environment Setup & Auth Verification

✓ GLOO_CLIENT_ID loaded
✓ GLOO_CLIENT_SECRET loaded

Test 1: Obtaining access token...
✓ Access token obtained
  Expires in: 3600 seconds

Test 2: Token caching (ensure_valid_token)...
✓ Token cached correctly — same token returned on consecutive calls

Test 3: Verifying streaming endpoint...
✓ Status: 200 OK
✓ Content-Type: text/event-stream; charset=utf-8

✅ Auth and streaming endpoint verified.
   Next: Making the Streaming Request
If tests fail, check:
  • .env file exists in the language directory (not just .env.example)
  • GLOO_CLIENT_ID and GLOO_CLIENT_SECRET are set correctly
  • You’ve completed the Authentication Tutorial prerequisites

Step 2: Streaming-Aware Error Handling

Now implement the stream error handler, a focused function that maps HTTP status codes to descriptive exceptions before any stream data is read.

Key Concepts

Two-Phase Error Handling

Streaming introduces two distinct error phases: Phase 1 — Pre-stream (before reading bytes): The HTTP status tells you everything. A 401 means bad token; a 429 means slow down. Check the status immediately and throw a specific error before touching the body. This is what the stream error handler does. Phase 2 — Mid-stream (while reading bytes): The connection is live when something fails — network drop, server restart, timeout. Catch these in the accumulation loop with a try/catch around the read loop. If you’ve already accumulated partial text, preserve it and return what you have rather than discarding the work. Separating these phases makes errors debuggable: pre-stream errors have status codes; mid-stream errors have partial content.

Implementation Guide

Open your streaming client file and find the error handler method, it’s a small, focused function with one case per status code. Review the TODO comments, then implement the function:
# File: streaming/stream_client.py
def handle_stream_error(status_code: int, response_body: str = "") -> None:
    if status_code == 401:
        raise Exception("Authentication failed (401): Invalid or expired token")
    elif status_code == 403:
        raise Exception("Authorization failed (403): Insufficient permissions")
    elif status_code == 429:
        raise Exception("Rate limit exceeded (429): Too many requests")
    elif status_code != 200:
        raise Exception(f"API error ({status_code}): {response_body[:200]}")
The code does the following:
  • Throws an authentication error on 401 if the token is missing, expired, or malformed
  • Throws an authorization error on 403 if the token is valid but lacks permission for this resource
  • Throws a rate limit error on 429 if the request was rejected before the API spent any compute
  • Throws a generic error for any other non-200 status, including the response body for diagnostic context
  • Returns without throwing on 200 so the caller can proceed to read the stream

✓ Checkpoint: Error Handling

Run the error handling test:
python tests/step2_error_handling_test.py
Your output should look similar to the following:
🧪 Testing: Streaming Error Handling

Test 1: handle_stream_error(401)...
✓ 401 raises: Authentication failed (401): Invalid or expired token
Test 2: handle_stream_error(403)...
✓ 403 raises: Authorization failed (403): Insufficient permissions
Test 3: handle_stream_error(429)...
✓ 429 raises: Rate limit exceeded (429): Too many requests
Test 4: handle_stream_error(200) — success, no exception...
✓ 200 OK — no exception raised
Test 5: handle_stream_error(500)...
✓ 500 throws with body: API error (500): Internal Server Error

✅ Two-phase error handling working.
   Next: Streaming Requests & SSE Parsing
If tests fail, check:
  • Status 200 must not raise an exception
  • The error message for non-200 includes the status code
  • The response body is truncated (first 200 chars) to avoid enormous error messages

Step 3: Streaming Requests & SSE Parsing

Time to wire up the streaming connection. You’ll open a persistent HTTP connection to the completions API and write the parser that converts raw SSE lines into something you can actually work with.

What You’ll Implement

  1. A function to initiate a streaming request
  2. A function to parse individual SSE lines

Making the Streaming Request

Why stream: true Changes Everything

Without stream: true, the API buffers the entire response and returns it as a single JSON object. With stream: true, it switches to SSE mode: the connection stays open and bytes arrive incrementally as the model generates them. This is why you return the raw response object rather than parsed JSON — the body isn’t fully available yet. The caller will read it line by line in the next steps.

Fail Fast Before Reading

Checking the HTTP status code before starting to read the stream is important for a clean user experience. A 401 response will never produce SSE data — it returns a JSON error body. If you skipped the status check and tried to parse lines from a 401 response, you’d get confusing parse errors instead of a clear “authentication failed” message.

Implementation Guide

Still in the same streaming client file, find the streaming request method, review the TODO comments, then implement the changes outlined in the code block:
# File: streaming/stream_client.py
def make_streaming_request(message: str, token: str):
    headers = {
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json",
    }
    payload = {
        "messages": [{"role": "user", "content": message}],
        "auto_routing": True,
        "stream": True,
    }
    response = requests.post(API_URL, headers=headers, json=payload, stream=True)
    handle_stream_error(
        response.status_code,
        response.text if response.status_code != 200 else "",
    )
    return response
The code does the following:
  • Sets Authorization and Content-Type headers using the provided token
  • Builds the request payload with stream: true to enable SSE mode and auto_routing: true to let Gloo select the best model
  • Checks the HTTP status before reading any response data, raising a descriptive error for non-200 responses
  • Returns the raw response object so the caller can iterate its body line by line
PHP note: cURL’s streaming architecture doesn’t allow inspecting the HTTP status before the write callback fires. The status check happens on the first data chunk instead. This is the idiomatic PHP pattern for streaming with cURL.

Parsing SSE Lines

The SSE Wire Format

SSE is a simple text protocol. Each event is one line starting with data: , terminated by a blank line. In practice, the Gloo AI stream looks like:
data: {"id":"...","choices":[{"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"...","choices":[{"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"...","choices":[{"delta":{"content":" world"},"finish_reason":"stop"}]}
Blank lines are separators, not errors — they’re common and should be silently skipped. Lines that don’t start with data: (such as event: or : comment lines) should also be skipped.

Defensive Parsing

The JSON parse is wrapped in a try/catch. Mid-stream network hiccups can produce partial lines — you don’t want a single malformed chunk to crash the entire stream. Return null for unparseable lines and let the accumulation loop move on.

Implementation Guide

You’re still working with the streaming client file. Find the SSE line parser method, review the TODO comments, then implement:
# File: streaming/stream_client.py
def parse_sse_line(line: str):
    if not line or not line.strip():
        return None
    if not line.startswith("data: "):
        return None
    data = line[6:]  # strip 'data: ' prefix
    if data.strip() == "[DONE]":
        return "[DONE]"
    try:
        return json.loads(data)
    except json.JSONDecodeError:
        return None
The code does the following:
  • Returns null for blank lines and lines that don’t start with data: , signalling the caller to skip to the next line
  • Strips the data: prefix to isolate the raw JSON payload
  • Detects the [DONE] sentinel before attempting JSON parsing and returns it as a string to signal the end of the stream
  • Parses the payload as JSON and returns the result, or null if parsing fails — never throws on malformed input

✓ Checkpoint: Streaming Request & SSE Parsing

Run the validation test for this step:
python tests/step3_sse_parsing_test.py
Your output should look similar to the following:
🧪 Testing: Streaming Request & SSE Line Parsing

✓ Token obtained

Test 1: parse_sse_line — blank line...
✓ Blank line → None
Test 2: parse_sse_line — non-data line...
✓ Non-data line → None
Test 3: parse_sse_line — [DONE] sentinel...
✓ data: [DONE] → '[DONE]'
Test 4: parse_sse_line — valid JSON data line...
✓ data: {json} → parsed dict
Test 5: parse_sse_line — malformed JSON...
✓ Malformed JSON → None (gracefully handled)

Test 6: make_streaming_request() — live connection...
✓ Streaming connection opened (status 200)
Test 7: Iterating SSE lines and detecting stream termination...
✓ Processed 5 lines, 2 data chunks
✓ Stream terminated cleanly (finish_reason=stop)

Test 8: Bad credentials → authentication error before reading stream...
✓ Bad credentials caught (pre-stream): Authorization failed (403): Insufficient permissions

✅ Streaming request and SSE parsing working.
   Next: Token Extraction & Accumulation
If tests fail, check:
  • The streaming request function sets stream to true in the payload
  • The SSE line parser strips exactly 6 characters ("data: " has a space after the colon)
  • The [DONE] check happens before the JSON parse

Step 4: Token Extraction & Accumulation

Next you’ll add the pieces to pull the token out of each parsed SSE chunk, and the accumulation loop that stitches everything together into a complete result.

What You’ll Implement

  1. A function to extract token content from a parsed SSE chunk
  2. A function to collect the full stream into a result object

Extracting Token Content

Why Content Can Be Absent

Not every SSE chunk carries text. The first chunk establishes the role (delta: {"role": "assistant"}), while the final chunk carries the finish reason with an empty or absent delta. Only chunks in the middle carry actual content. This is why you return an empty string rather than throwing since an absent content field is completely normal. The accumulation loop skips empty strings when counting tokens.

Null-Safe Navigation

Different languages handle missing keys differently. In Python, .get() returns None without raising; in JavaScript/TypeScript, optional chaining (?.) does the same. In Go and Java the struct is fully typed, so missing content simply maps to the zero value. The goal in all languages is the same: never throw when a field is absent.

Implementation Guide

Still in the streaming client file, find the token content extractor, review the TODO comments, then implement:
# File: streaming/stream_client.py
def extract_token_content(chunk: dict) -> str:
    try:
        choices = chunk.get("choices", [])
        if not choices:
            return ""
        delta = choices[0].get("delta", {})
        return delta.get("content") or ""
    except (IndexError, AttributeError, KeyError):
        return ""
The code does the following:
  • Returns an empty string immediately if choices is absent or empty — the first and last chunks often carry no content
  • Reads delta.content from the first choice, returning an empty string if the field is absent or null
  • Handles any unexpected chunk structure by returning an empty string rather than throwing, keeping the accumulation loop running cleanly

Accumulating the Full Response

Two Ways to Consume a Stream

You can either accumulate all tokens into a string (what the function in this step does) or print each token immediately as it arrives (what the function in Step 5 does). The choice depends on whether you need the full text before taking action:
  • Accumulate: useful when you need to parse the full response, log it, or return it from an API
  • Print immediately: useful for CLI tools and browser UIs where you want the typing effect

The Line Buffer (JS/TS/PHP)

In Python and Go, the HTTP libraries provide line-at-a-time iteration. In JavaScript, TypeScript, and PHP, you read raw bytes and split on \n yourself. This requires a line buffer: keep any incomplete final chunk in a variable and prepend it to the next read’s output. Without it, tokens near chunk boundaries get split across two parse calls.
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n");
buffer = lines.pop() ?? ""; // save incomplete last line

Implementation Guide

Open the streaming client file and find the accumulation loop method. This one brings together everything from the previous steps, with the TODO comments showing each stage. Take a moment to trace through the structure before implementing.
# File: streaming/stream_client.py
def stream_completion(message: str, token: str) -> dict:
    start_time = time.time()
    response = make_streaming_request(message, token)

    full_text = ""
    token_count = 0
    finish_reason = "unknown"

    try:
        for raw_line in response.iter_lines(decode_unicode=True):
            chunk = parse_sse_line(raw_line)
            if chunk is None:
                continue
            if chunk == "[DONE]":
                break
            content = extract_token_content(chunk)
            if content:
                full_text += content
                token_count += 1
            choices = chunk.get("choices", [])
            if choices and choices[0].get("finish_reason"):
                finish_reason = choices[0]["finish_reason"]
    except Exception:
        if full_text:
            pass  # preserve partial output on error
        else:
            raise

    duration_ms = int((time.time() - start_time) * 1000)
    return {
        "text": full_text,
        "token_count": token_count,
        "duration_ms": duration_ms,
        "finish_reason": finish_reason,
    }
The code does the following:
  • Records the start time before opening the stream so elapsed duration includes connection overhead
  • Initializes accumulators for the full text, token count, and finish reason
  • Iterates the stream line by line, parsing each with the SSE parser and skipping null lines
  • Stops the loop when a non-null finish_reason is detected or a [DONE] sentinel arrives
  • Returns a single result object containing the assembled text, token count, elapsed duration in milliseconds, and finish reason

✓ Checkpoint: Token Extraction & Accumulation

Run the validation test for this step:
python tests/step4_accumulation_test.py
Your output should look similar to the following:
🧪 Testing: Token Extraction & Accumulation

Test 1: extract_token_content — normal chunk...
✓ Normal chunk → 'Hello'
Test 2: extract_token_content — null content delta...
✓ Null content → ''
Test 3: extract_token_content — empty delta (role-only chunk)...
✓ Empty delta → ''
Test 4: extract_token_content — no choices...
✓ Empty choices → ''
Test 5: extract_token_content — finish_reason chunk...
✓ finish_reason chunk → '' (no content tokens from finish chunk)

Test 6: stream_completion — full response assembly...
✓ Delta content extraction working
✓ Null delta handled gracefully
✓ finish_reason detected: stop
✓ Duration tracked: 2098ms
✓ Token count: 2 tokens
  Response preview: '1 2 3 4 5'

✅ Full response assembled.
   Next: Typing-Effect Renderer
If tests fail, check:
  • The token content extractor returns "" (not None/null) when content is absent
  • The accumulation loop reads finish_reason from choices[0], not from the top-level chunk
  • The line buffer (buffer = lines.pop()) is in place for JS/TS/PHP

Step 5: Typing-Effect Terminal Renderer

Now implement the terminal renderer, a function that prints each token immediately to stdout without a newline, creating a live typing effect in the terminal. This step demonstrates an important pattern: consuming the stream directly rather than accumulating it first. The renderer calls the streaming request, SSE parsing, and token extraction functions, but skips the accumulation loop entirely.

Key Concepts

Unbuffered Output

By default, most languages buffer stdout which means that output is held until the buffer fills or the program exits. For a typing effect you need every token to appear immediately. Each language has its own way to force this:
LanguageUnbuffered write
Pythonprint(content, end="", flush=True)
JavaScript / TypeScriptprocess.stdout.write(content)
PHPecho $content; ob_flush(); flush();
Gofmt.Fprint(os.Stdout, content) (stdout is unbuffered by default)
JavaSystem.out.print(content); System.out.flush();

Direct Stream Consumption vs. Accumulation

The stream completion function from Step 4 accumulates everything and returns once the stream is complete. The terminal renderer function prints as it goes, the user sees output before the model has finished generating. Both patterns are valid; the right choice depends on whether the output needs to be complete before it’s useful.

Implementation Guide

Open the renderer file referenced in the code block. Unlike the streaming client, this file has a single method to implement. Review the TODO comments, then implement:
# File: browser/renderer.py
def render_stream_to_terminal(message: str, token: str) -> None:
    print(f"Prompt: {message}\n")
    print("Response: ", end="", flush=True)

    response = make_streaming_request(message, token)

    total_tokens = 0
    finish_reason = "unknown"

    for raw_line in response.iter_lines(decode_unicode=True):
        chunk = parse_sse_line(raw_line)
        if chunk is None:
            continue
        if chunk == "[DONE]":
            break
        content = extract_token_content(chunk)
        if content:
            print(content, end="", flush=True)
            total_tokens += 1
        choices = chunk.get("choices", [])
        if choices and choices[0].get("finish_reason"):
            finish_reason = choices[0]["finish_reason"]

    print()
    print(f"\n[{total_tokens} tokens, finish_reason={finish_reason}]")
The code does the following:
  • Prints the user’s message as a prompt header before the response begins
  • Opens the stream and iterates SSE lines directly, without an accumulation loop, so tokens are available to print as soon as they arrive
  • Writes each token to stdout without a trailing newline and flushes immediately, producing a character-by-character typing effect
  • Prints a summary line with the total token count and finish reason after the stream ends

✓ Checkpoint: Terminal Renderer

Run the validation test:
python tests/step5_renderer_test.py
Your output should look similar to the following:
🧪 Testing: Typing-Effect Renderer

✓ Token obtained

Test 1: render_stream_to_terminal() — streaming to terminal...
Prompt: Reply with exactly: Hello streaming world

Response: Hello streaming world

[2 tokens, finish_reason=stop]
✓ Prompt header printed
✓ Response label printed
✓ Token summary found: 2 tokens, finish_reason=stop

✅ Typing-effect renderer working.
   Next: Server-Side Proxy
With a short prompt like this, tokens arrive so quickly that the typing effect may not be visible — the response appears all at once. That’s expected. In production, longer AI responses make the effect clear: each token renders as it arrives rather than waiting for the full response. This is the pattern your chat UI will use.
If tests fail, check:
  • Each token is written with no trailing newline
  • flush() or equivalent is called after each write
  • The summary line format is [N tokens, finish_reason=X]

Step 6: Server-Side Proxy

In this step you’ll implement the proxy server’s stream handler. This is the route that receives requests from browser clients, forwards them upstream to Gloo AI with a server-side auth token, and pipes the SSE response back.

Key Concepts

Why a Proxy?

Browser JavaScript cannot safely include API credentials because anything in client code is visible to anyone who opens DevTools. A proxy server is the standard solution: the browser POSTs to your server, your server adds the auth token and POSTs to Gloo AI, and the SSE stream flows back through your server to the browser. An additional benefit: the proxy can add rate limiting, logging, and multi-tenant auth logic without touching client code.

SSE Headers That Matter

Three headers tell the browser (and any reverse proxies like nginx) that this is a live stream, not a buffered response:
HeaderValueWhy
Content-Typetext/event-streamIdentifies the SSE protocol
Cache-Controlno-cachePrevents browser caching of the stream
X-Accel-BufferingnoDisables nginx buffering so bytes arrive immediately

Language-Specific Flushing

Each language needs an explicit flush mechanism to push bytes to the client immediately:
LanguageFlush mechanism
Python (Flask)yield from a generator — Flask flushes on each yield
JavaScript/TypeScriptres.write() — Express sends immediately
PHPflush() after each write
Goflusher.Flush() — requires http.Flusher interface
Javaout.flush() after each write

Implementation Guide

Open the proxy server file referenced in the code block. The server setup and routing are already in place. Find the stream handler method (or route handler, depending on the language), review the TODO comments, and implement the relay logic:
# File: proxy/server.py
@app.route("/api/stream", methods=["POST", "OPTIONS"])
def stream_proxy():
    if request.method == "OPTIONS":
        return Response(status=204)

    request_data = request.get_json() or {}

    def generate():
        try:
            auth_token = ensure_valid_token()
            headers = {
                "Authorization": f"Bearer {auth_token}",
                "Content-Type": "application/json",
            }
            payload = {**request_data, "stream": True}

            with requests.post(
                API_URL, headers=headers, json=payload, stream=True
            ) as resp:
                if resp.status_code != 200:
                    yield f'data: {{"error": "API error {resp.status_code}"}}\n\n'
                    return

                for line in resp.iter_lines():
                    if line:
                        decoded = line.decode("utf-8")
                        yield f"{decoded}\n\n"

        except Exception as e:
            yield f'data: {{"error": "{str(e)}"}}\n\n'

    return Response(
        generate(),
        mimetype="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",
        },
    )
The code does the following:
  • Sets Content-Type: text/event-stream, Cache-Control: no-cache, and X-Accel-Buffering: no before writing any response data
  • Handles OPTIONS preflight requests immediately so browsers can POST cross-origin
  • Retrieves a fresh auth token using the pre-built token manager, keeping credentials server-side
  • Reads the incoming request body, injects stream: true, and forwards the request to the Gloo AI API
  • Relays each non-blank SSE line to the client and flushes immediately so tokens reach the browser as they arrive
  • Writes a structured error SSE frame if the upstream request fails, avoiding a silent stream close
PHP, Go, and Java use a generic HTTP handler that receives all request methods, so they include an explicit 405 check before the streaming logic. Python, JavaScript, and TypeScript register the route for POST only, so the framework rejects other methods automatically.

✓ Checkpoint: Proxy Server

Run the proxy server validation test:
python tests/step6_proxy_test.py
Your output should look similar to the following:
🧪 Testing: Server-Side Proxy

Test 1: Starting proxy server on port 3001...
 * Serving Flask app 'proxy.server'
 * Debug mode: off
✓ Proxy server running at http://localhost:3001

Test 2: /health endpoint...
✓ /health returns: {'service': 'completions-streaming-proxy', 'status': 'ok'}

Test 3: POST /api/stream — Content-Type header...
✓ Content-Type: text/event-stream; charset=utf-8

Test 4: SSE line format (data: prefix)...
✓ All lines have 'data: ' prefix (3 data chunks received)
✓ Stream terminated cleanly (finish_reason=stop)

Test 5: CORS headers on response...
✓ Access-Control-Allow-Origin: http://localhost:3000

✅ Proxy server relaying SSE end-to-end.
   Proxy complete: credentials stay server-side, client receives SSE.
If tests fail, check:
  • CORS headers are set before sending the response headers (Java)
  • X-Accel-Buffering: no is present (required to disable nginx buffering)
  • Go: the flusher interface assertion must succeed — this panics if the ResponseWriter doesn’t support flushing
  • PHP: clear any existing output buffers before setting SSE headers

Step 7: Testing Your Complete Implementation

With all six steps implemented, you can now run the full demo, test the proxy server via API, and explore the browser demo.

Run the Demo Script

The entry point runs both examples back-to-back: first it accumulates a full response and prints it, then it streams a second response to the terminal with a typing effect.
python main.py
Your output should look similar to:
Streaming AI Responses in Real Time

Environment variables loaded

Example: Streaming a completion (accumulate full text)...

Full response:
The resurrection of Jesus Christ is a cornerstone of Christian 
faith, holding profound significance for believers. It's not 
merely a historical event but a theological truth that reshapes 
our understanding of God, humanity, and...

Received 16 tokens in 6864ms
  Finish reason: stop

Example: Typing-effect rendering...
Prompt: Tell me about Christian discipleship.

Response: Christian discipleship is a transformative journey of 
following Jesus Christ, learning from His teachings, and striving 
to live a life that reflects His character and mission...

[11 tokens, finish_reason=stop]

Test the Proxy Server via API

Start the proxy server in one terminal:
python proxy/server.py
Then send a request from another terminal using curl:
curl -X POST http://localhost:3001/api/stream \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}], "auto_routing": true}'
You will see the SSE stream arrive line by line:
data: {"id": "gen-abc123", "choices": [{"delta": {"content": "Hello", "function_call": null, "refusal": null, "role": "assistant", "tool_calls": null}, "finish_reason": null, "index": 0, "logprobs": null, "native_finish_reason": null}], "created": 1774527271, "model": "google/gemini-2.5-flash", "object": "chat.completion.chunk", "service_tier": null, "system_fingerprint": null, "usage": null, "provider": "Gloo AI", "ttft_ms": 940.61}

data: {"id": "gen-abc123", "choices": [{"delta": {"content": "! How", "function_call": null, "refusal": null, "role": "assistant", "tool_calls": null}, "finish_reason": null, "index": 0, "logprobs": null, "native_finish_reason": null}], "created": 1774527271, "model": "google/gemini-2.5-flash", "object": "chat.completion.chunk", "service_tier": null, "system_fingerprint": null, "usage": null, "provider": "Gloo AI"}

data: {"id": "gen-abc123", "choices": [{"delta": {"content": " can I help you today?", "function_call": null, "refusal": null, "role": "assistant", "tool_calls": null}, "finish_reason": null, "index": 0, "logprobs": null, "native_finish_reason": null}], "created": 1774527271, "model": "google/gemini-2.5-flash", "object": "chat.completion.chunk", "service_tier": null, "system_fingerprint": null, "usage": null, "provider": "Gloo AI"}

data: {"id": "gen-abc123", "choices": [{"delta": {"content": "", "function_call": null, "refusal": null, "role": "assistant", "tool_calls": null}, "finish_reason": "stop", "index": 0, "logprobs": null, "native_finish_reason": "STOP"}], "created": 1774527271, "model": "google/gemini-2.5-flash", "object": "chat.completion.chunk", "service_tier": null, "system_fingerprint": null, "usage": null, "provider": "Gloo AI"}
Each line is a JSON-encoded delta. The final chunk signals the end of the stream with a non-null finish_reason.

Browser Demo

The browser demo is a standalone HTML file separate from the language starter projects — no install step required.

frontend-example/

Download or clone this directory alongside your language starter
The file connects to the proxy over HTTP, so it works with any language’s proxy server. With the proxy already running on port 3001, serve the browser client from the frontend-example/ directory using whichever tool you have available:
# Node
npx serve

# Python
python -m http.server 3000

# PHP
php -S localhost:3000
Do not open index.html directly via File > Open. When loaded as a file:// URL, the browser reports Origin: null, which the proxy’s CORS policy rejects. You must serve the file over HTTP so the origin is http://localhost:3000.
Then open http://localhost:3000 in your browser, type a question, and click Send — tokens appear one by one as they arrive from the proxy. Gloo AI Streaming Demo browser page showing a streamed response to "What is my purpose in life"

How the Browser Connects to the Stream

Browsers have a built-in API called EventSource designed for receiving server-sent events — but it only supports GET requests. Since the completions API requires a POST body containing the message text, EventSource can’t be used here. Instead, the demo page uses fetch() with a ReadableStream, which supports any HTTP method:
// File: frontend-example/index.html
const response = await fetch("http://localhost:3001/api/stream", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ messages: [{ role: "user", content: message }] }),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
The ReadableStream API works identically to what you used in the terminal renderer — the same line buffer, SSE parser, and token extractor pattern applies.

Markdown Rendering

AI responses often contain markdown. Inserting raw tokens directly into the DOM produces broken mid-stream output — **bo appears before ld** closes the bold span. The correct pattern is to accumulate tokens and re-parse the full buffer on each token:
// File: frontend-example/index.html
let buffer = "";

// On each token:
buffer += content;
outputEl.innerHTML = DOMPurify.sanitize(marked.parse(buffer));
marked.parse() runs on every token — slightly redundant but always produces valid HTML. DOMPurify.sanitize() prevents XSS from any HTML in the AI response.
For production, serve the browser client from the same origin as the proxy, or set PROXY_CORS_ORIGIN in your .env to match your frontend domain.
For React applications, the Vercel AI SDK useChat hook handles streaming, markdown rendering, and state management out of the box — it’s a higher-level alternative to building this pattern manually.

Troubleshooting

Stream hangs and never produces output : Verify "stream": true is in the request payload. Without it, the API returns a single buffered JSON response so the connection may appear to hang while waiting for a response format that never arrives. Garbled or split tokens : The line buffer is missing or incorrect. In JS/TS/PHP, raw bytes must be accumulated and split on \n before parsing. Make sure buffer = lines.pop() saves the incomplete last fragment. Authentication failed (401) : Your .env file is missing GLOO_CLIENT_ID or GLOO_CLIENT_SECRET, or the values are incorrect. Run the Step 1 checkpoint to verify credentials load correctly. Browser blocks direct API calls (CORS error) : Browsers enforce same-origin policy. Direct calls from browser JavaScript to platform.ai.gloo.com will be blocked. Use the proxy server (Step 6) so API calls happen server-side. Failed to fetch when serving the browser demo on a port other than 3000 : The proxy allows requests only from http://localhost:3000 by default. If your file server uses a different port (e.g. VS Code / Cursor Live Server on port 5500, or python -m http.server 8080), the browser’s Origin header won’t match and the proxy blocks the request. Fix: set PROXY_CORS_ORIGIN in your .env to the exact origin shown in your browser’s address bar, then restart the proxy.
# .env — must be an exact match including hostname
PROXY_CORS_ORIGIN=http://127.0.0.1:5500  # Cursor / VS Code Live Server
Note that http://localhost:5500 and http://127.0.0.1:5500 are treated as different origins by the browser even though they resolve to the same address. Copy the origin directly from the address bar to avoid a mismatch. PHP output appears all at once : PHP’s output buffering is active. Call ob_end_flush() (or while (ob_get_level() > 0) ob_end_flush()) before the SSE loop to disable buffering. Go panics on w.(http.Flusher) : Your http.ResponseWriter doesn’t implement http.Flusher. This shouldn’t happen with the standard net/http server, but will happen with some test wrappers. Make sure you’re using http.ResponseWriter directly. Mid-stream disconnect loses all output : Wrap the read loop in try/catch (or check errors in Go). If fullText already has content when the error occurs, return it rather than re-raising — partial responses are usually more useful than nothing. Broken markdown mid-stream : Do not insert raw tokens into innerHTML. Accumulate the full buffer and call marked.parse(buffer) on every token — this ensures the markdown is always valid HTML at each step.

View the Completed Project

If you want to see a working reference before or after completing the steps, the final project is available in the tutorial repository:

Completed Project

Browse the complete implementation in all six languages — Python, JavaScript, TypeScript, PHP, Go, and Java.

Next Steps

  • Grounded Completions — add retrieved context from your content library to improve response accuracy
  • Tool Use — combine streaming with function calling for real-time tool-augmented responses
  • Completions API reference — explore all available parameters including tradition, model_family, and model