Streaming AI Responses in Real Time

Overview

The Gloo AI completions API supports streaming responses, so instead of waiting for the full answer, your application receives tokens one at a time as the model generates them. This creates a faster, more interactive user experience and is the standard pattern for chat and content-generation products. In this tutorial you’ll build a streaming client from scratch: parsing the SSE wire protocol, accumulating tokens, handling errors, and rendering output as it arrives. You’ll also build a server-side proxy that shields your API credentials from the browser.

What You’ll Build

By the end of this tutorial, you’ll have a complete streaming implementation featuring:

SSE stream parser that reads tokens as they arrive from the API
Token accumulator that assembles the full response with timing and token count
Streaming-aware error handler that catches auth and rate-limit errors before reading the stream
Terminal renderer that displays tokens in real time with a typing effect
Server-side proxy that relays the stream to browser clients without exposing your credentials

Understanding Server-Sent Events

When you set "stream": true in a completions request, the API switches from a single JSON response to an SSE stream. Each token arrives as a line formatted data: <json>, with blank lines separating events:

data: {"choices":[{"delta":{"content":"The"},"finish_reason":null}]}

data: {"choices":[{"delta":{"content":" resurrection"},"finish_reason":null}]}

data: {"choices":[{"delta":{"content":" is"},"finish_reason":null}]}

data: {"choices":[{"delta":{"content":"..."},"finish_reason":"stop"}]}

The stream ends when a chunk arrives with a non-null finish_reason (typically "stop").

Two Approaches: Direct vs. Proxy

This tutorial covers two ways to consume the stream:

Approach	How it works	When to use
Terminal	Your server calls the API directly and prints tokens	Background jobs, CLIs, server-side rendering
Proxy	A lightweight server relays SSE to any external client	Web apps, any case where browser JS would expose credentials

Prerequisites

Before starting, ensure you have:

A Gloo AI Studio account with API credentials
Your Client ID and Client Secret from the API Credentials page
Authentication setup — complete the Authentication Tutorial first

The starter project includes a pre-built auth module. You don’t need to implement authentication in this tutorial — it’s already working in the starter code.

Getting Started with the Starter Project

This tutorial uses a hands-on approach where you’ll build the streaming client incrementally. The starter code provides complete scaffolding with TODO markers guiding each step.

Download the Starter Code

Choose your preferred language and download the starter project:

Python

Python 3.9+ · requests · Flask

JavaScript

Node.js 18+ · native fetch · Express

TypeScript

TypeScript 5+ · typed SSE chunks

PHP

PHP 8.1+ · cURL write callback

Go

Go 1.20+ · bufio.Scanner · http.Flusher

Java

Java 17+ · HttpClient · Maven

Quick Setup

cd starter/python
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env
# Edit .env with your GLOO_CLIENT_ID and GLOO_CLIENT_SECRET

Test Your Setup

Run the entry point — it should load your credentials and confirm the stubs are in place:

python main.py

You should see credentials load successfully, followed by NotImplementedError (or equivalent) from the first stub — confirming that setup is complete and you’re ready to implement.

Architecture Overview

Component Architecture

Implementation Roadmap

Step	What You Build	Track	Validates
1	Environment setup	Shared	Auth loads; streaming endpoint reachable
2	Handle stream errors	Shared	401/403/429 errors thrown before stream read
3	Streaming request + SSE parsing	Shared	HTTP connection opens; SSE lines parsed; `[DONE]` detected
4	Token extraction + accumulation	Shared	Token text extracted; full response assembled with timing
5	Render stream to terminal	Terminal	Tokens print live to terminal
6	Proxy stream handler	Proxy	SSE relayed through server
7 †	Testing & browser demo	Proxy	End-to-end validation

† No new implementation — run the demo, test the proxy via API, and explore the browser client.

Steps 1–5 build the streaming client. Step 6 adds the server-side proxy. Step 7 walks through the browser demo. Let’s get started!

Step 1: Environment Setup & Auth Verification

The starter project includes a pre-built auth module that handles OAuth2 client credentials. Before implementing any streaming logic, confirm it works with the streaming endpoint.

What You’ll Verify

Credentials load correctly from .env
A token can be obtained from the Gloo AI auth server
A request to the completions endpoint returns 200 OK with Content-Type: text/event-stream

Testing Your Setup

Run the Step 1 checkpoint now — it should pass with the pre-built auth:

python tests/step1_auth_test.py

✓ Checkpoint: Auth Verification

Your output should look similar to the following:

🧪 Testing: Environment Setup & Auth Verification

✓ GLOO_CLIENT_ID loaded
✓ GLOO_CLIENT_SECRET loaded

Test 1: Obtaining access token...
✓ Access token obtained
  Expires in: 3600 seconds

Test 2: Token caching (ensure_valid_token)...
✓ Token cached correctly — same token returned on consecutive calls

Test 3: Verifying streaming endpoint...
✓ Status: 200 OK
✓ Content-Type: text/event-stream; charset=utf-8

✅ Auth and streaming endpoint verified.
   Next: Making the Streaming Request

If tests fail, check:

.env file exists in the language directory (not just .env.example)
GLOO_CLIENT_ID and GLOO_CLIENT_SECRET are set correctly
You’ve completed the Authentication Tutorial prerequisites

Step 2: Streaming-Aware Error Handling

Now implement the stream error handler, a focused function that maps HTTP status codes to descriptive exceptions before any stream data is read.

Key Concepts

Two-Phase Error Handling

Streaming introduces two distinct error phases: Phase 1 — Pre-stream (before reading bytes): The HTTP status tells you everything. A 401 means bad token; a 429 means slow down. Check the status immediately and throw a specific error before touching the body. This is what the stream error handler does. Phase 2 — Mid-stream (while reading bytes): The connection is live when something fails — network drop, server restart, timeout. Catch these in the accumulation loop with a try/catch around the read loop. If you’ve already accumulated partial text, preserve it and return what you have rather than discarding the work. Separating these phases makes errors debuggable: pre-stream errors have status codes; mid-stream errors have partial content.

Implementation Guide

Open your streaming client file and find the error handler method, it’s a small, focused function with one case per status code. Review the TODO comments, then implement the function:

# File: streaming/stream_client.py
def handle_stream_error(status_code: int, response_body: str = "") -> None:
    if status_code == 401:
        raise Exception("Authentication failed (401): Invalid or expired token")
    elif status_code == 403:
        raise Exception("Authorization failed (403): Insufficient permissions")
    elif status_code == 429:
        raise Exception("Rate limit exceeded (429): Too many requests")
    elif status_code != 200:
        raise Exception(f"API error ({status_code}): {response_body[:200]}")

The code does the following:

Throws an authentication error on 401 if the token is missing, expired, or malformed
Throws an authorization error on 403 if the token is valid but lacks permission for this resource
Throws a rate limit error on 429 if the request was rejected before the API spent any compute
Throws a generic error for any other non-200 status, including the response body for diagnostic context
Returns without throwing on 200 so the caller can proceed to read the stream

✓ Checkpoint: Error Handling

Run the error handling test:

python tests/step2_error_handling_test.py

Your output should look similar to the following:

🧪 Testing: Streaming Error Handling

Test 1: handle_stream_error(401)...
✓ 401 raises: Authentication failed (401): Invalid or expired token
Test 2: handle_stream_error(403)...
✓ 403 raises: Authorization failed (403): Insufficient permissions
Test 3: handle_stream_error(429)...
✓ 429 raises: Rate limit exceeded (429): Too many requests
Test 4: handle_stream_error(200) — success, no exception...
✓ 200 OK — no exception raised
Test 5: handle_stream_error(500)...
✓ 500 throws with body: API error (500): Internal Server Error

✅ Two-phase error handling working.
   Next: Streaming Requests & SSE Parsing

If tests fail, check:

Status 200 must not raise an exception
The error message for non-200 includes the status code
The response body is truncated (first 200 chars) to avoid enormous error messages

Step 3: Streaming Requests & SSE Parsing

Time to wire up the streaming connection. You’ll open a persistent HTTP connection to the completions API and write the parser that converts raw SSE lines into something you can actually work with.

What You’ll Implement

A function to initiate a streaming request
A function to parse individual SSE lines

Making the Streaming Request

Why `stream: true` Changes Everything

Without stream: true, the API buffers the entire response and returns it as a single JSON object. With stream: true, it switches to SSE mode: the connection stays open and bytes arrive incrementally as the model generates them. This is why you return the raw response object rather than parsed JSON — the body isn’t fully available yet. The caller will read it line by line in the next steps.

Fail Fast Before Reading

Checking the HTTP status code before starting to read the stream is important for a clean user experience. A 401 response will never produce SSE data — it returns a JSON error body. If you skipped the status check and tried to parse lines from a 401 response, you’d get confusing parse errors instead of a clear “authentication failed” message.

Implementation Guide

Still in the same streaming client file, find the streaming request method, review the TODO comments, then implement the changes outlined in the code block:

# File: streaming/stream_client.py
def make_streaming_request(message: str, token: str):
    headers = {
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json",
    }
    payload = {
        "messages": [{"role": "user", "content": message}],
        "auto_routing": True,
        "stream": True,
    }
    response = requests.post(API_URL, headers=headers, json=payload, stream=True)
    handle_stream_error(
        response.status_code,
        response.text if response.status_code != 200 else "",
    )
    return response

The code does the following:

Sets Authorization and Content-Type headers using the provided token
Builds the request payload with stream: true to enable SSE mode and auto_routing: true to let Gloo select the best model
Checks the HTTP status before reading any response data, raising a descriptive error for non-200 responses
Returns the raw response object so the caller can iterate its body line by line

PHP note: cURL’s streaming architecture doesn’t allow inspecting the HTTP status before the write callback fires. The status check happens on the first data chunk instead. This is the idiomatic PHP pattern for streaming with cURL.

Parsing SSE Lines

The SSE Wire Format

SSE is a simple text protocol. Each event is one line starting with data: , terminated by a blank line. In practice, the Gloo AI stream looks like:

data: {"id":"...","choices":[{"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"...","choices":[{"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"...","choices":[{"delta":{"content":" world"},"finish_reason":"stop"}]}

Blank lines are separators, not errors — they’re common and should be silently skipped. Lines that don’t start with data: (such as event: or : comment lines) should also be skipped.

Defensive Parsing

The JSON parse is wrapped in a try/catch. Mid-stream network hiccups can produce partial lines — you don’t want a single malformed chunk to crash the entire stream. Return null for unparseable lines and let the accumulation loop move on.

Implementation Guide

You’re still working with the streaming client file. Find the SSE line parser method, review the TODO comments, then implement:

# File: streaming/stream_client.py
def parse_sse_line(line: str):
    if not line or not line.strip():
        return None
    if not line.startswith("data: "):
        return None
    data = line[6:]  # strip 'data: ' prefix
    if data.strip() == "[DONE]":
        return "[DONE]"
    try:
        return json.loads(data)
    except json.JSONDecodeError:
        return None

The code does the following:

Returns null for blank lines and lines that don’t start with data: , signalling the caller to skip to the next line
Strips the data: prefix to isolate the raw JSON payload
Detects the [DONE] sentinel before attempting JSON parsing and returns it as a string to signal the end of the stream
Parses the payload as JSON and returns the result, or null if parsing fails — never throws on malformed input

✓ Checkpoint: Streaming Request & SSE Parsing

Run the validation test for this step:

python tests/step3_sse_parsing_test.py

Your output should look similar to the following:

🧪 Testing: Streaming Request & SSE Line Parsing

✓ Token obtained

Test 1: parse_sse_line — blank line...
✓ Blank line → None
Test 2: parse_sse_line — non-data line...
✓ Non-data line → None
Test 3: parse_sse_line — [DONE] sentinel...
✓ data: [DONE] → '[DONE]'
Test 4: parse_sse_line — valid JSON data line...
✓ data: {json} → parsed dict
Test 5: parse_sse_line — malformed JSON...
✓ Malformed JSON → None (gracefully handled)

Test 6: make_streaming_request() — live connection...
✓ Streaming connection opened (status 200)
Test 7: Iterating SSE lines and detecting stream termination...
✓ Processed 5 lines, 2 data chunks
✓ Stream terminated cleanly (finish_reason=stop)

Test 8: Bad credentials → authentication error before reading stream...
✓ Bad credentials caught (pre-stream): Authorization failed (403): Insufficient permissions

✅ Streaming request and SSE parsing working.
   Next: Token Extraction & Accumulation

If tests fail, check:

The streaming request function sets stream to true in the payload
The SSE line parser strips exactly 6 characters ("data: " has a space after the colon)
The [DONE] check happens before the JSON parse

Step 4: Token Extraction & Accumulation

Next you’ll add the pieces to pull the token out of each parsed SSE chunk, and the accumulation loop that stitches everything together into a complete result.

What You’ll Implement

A function to extract token content from a parsed SSE chunk
A function to collect the full stream into a result object

Extracting Token Content

Why Content Can Be Absent

Not every SSE chunk carries text. The first chunk establishes the role (delta: {"role": "assistant"}), while the final chunk carries the finish reason with an empty or absent delta. Only chunks in the middle carry actual content. This is why you return an empty string rather than throwing since an absent content field is completely normal. The accumulation loop skips empty strings when counting tokens. Different languages handle missing keys differently. In Python, .get() returns None without raising; in JavaScript/TypeScript, optional chaining (?.) does the same. In Go and Java the struct is fully typed, so missing content simply maps to the zero value. The goal in all languages is the same: never throw when a field is absent.

Implementation Guide

Still in the streaming client file, find the token content extractor, review the TODO comments, then implement:

# File: streaming/stream_client.py
def extract_token_content(chunk: dict) -> str:
    try:
        choices = chunk.get("choices", [])
        if not choices:
            return ""
        delta = choices[0].get("delta", {})
        return delta.get("content") or ""
    except (IndexError, AttributeError, KeyError):
        return ""

The code does the following:

Returns an empty string immediately if choices is absent or empty — the first and last chunks often carry no content
Reads delta.content from the first choice, returning an empty string if the field is absent or null
Handles any unexpected chunk structure by returning an empty string rather than throwing, keeping the accumulation loop running cleanly

Accumulating the Full Response

Two Ways to Consume a Stream

You can either accumulate all tokens into a string (what the function in this step does) or print each token immediately as it arrives (what the function in Step 5 does). The choice depends on whether you need the full text before taking action:

Accumulate: useful when you need to parse the full response, log it, or return it from an API
Print immediately: useful for CLI tools and browser UIs where you want the typing effect

The Line Buffer (JS/TS/PHP)

In Python and Go, the HTTP libraries provide line-at-a-time iteration. In JavaScript, TypeScript, and PHP, you read raw bytes and split on \n yourself. This requires a line buffer: keep any incomplete final chunk in a variable and prepend it to the next read’s output. Without it, tokens near chunk boundaries get split across two parse calls.

buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n");
buffer = lines.pop() ?? ""; // save incomplete last line

Implementation Guide

Open the streaming client file and find the accumulation loop method. This one brings together everything from the previous steps, with the TODO comments showing each stage. Take a moment to trace through the structure before implementing.

# File: streaming/stream_client.py
def stream_completion(message: str, token: str) -> dict:
    start_time = time.time()
    response = make_streaming_request(message, token)

    full_text = ""
    token_count = 0
    finish_reason = "unknown"

    try:
        for raw_line in response.iter_lines(decode_unicode=True):
            chunk = parse_sse_line(raw_line)
            if chunk is None:
                continue
            if chunk == "[DONE]":
                break
            content = extract_token_content(chunk)
            if content:
                full_text += content
                token_count += 1
            choices = chunk.get("choices", [])
            if choices and choices[0].get("finish_reason"):
                finish_reason = choices[0]["finish_reason"]
    except Exception:
        if full_text:
            pass  # preserve partial output on error
        else:
            raise

    duration_ms = int((time.time() - start_time) * 1000)
    return {
        "text": full_text,
        "token_count": token_count,
        "duration_ms": duration_ms,
        "finish_reason": finish_reason,
    }

The code does the following:

Records the start time before opening the stream so elapsed duration includes connection overhead
Initializes accumulators for the full text, token count, and finish reason
Iterates the stream line by line, parsing each with the SSE parser and skipping null lines
Stops the loop when a non-null finish_reason is detected or a [DONE] sentinel arrives
Returns a single result object containing the assembled text, token count, elapsed duration in milliseconds, and finish reason

✓ Checkpoint: Token Extraction & Accumulation

Run the validation test for this step:

python tests/step4_accumulation_test.py

Your output should look similar to the following:

🧪 Testing: Token Extraction & Accumulation

Test 1: extract_token_content — normal chunk...
✓ Normal chunk → 'Hello'
Test 2: extract_token_content — null content delta...
✓ Null content → ''
Test 3: extract_token_content — empty delta (role-only chunk)...
✓ Empty delta → ''
Test 4: extract_token_content — no choices...
✓ Empty choices → ''
Test 5: extract_token_content — finish_reason chunk...
✓ finish_reason chunk → '' (no content tokens from finish chunk)

Test 6: stream_completion — full response assembly...
✓ Delta content extraction working
✓ Null delta handled gracefully
✓ finish_reason detected: stop
✓ Duration tracked: 2098ms
✓ Token count: 2 tokens
  Response preview: '1 2 3 4 5'

✅ Full response assembled.
   Next: Typing-Effect Renderer

If tests fail, check:

The token content extractor returns "" (not None/null) when content is absent
The accumulation loop reads finish_reason from choices[0], not from the top-level chunk
The line buffer (buffer = lines.pop()) is in place for JS/TS/PHP

Step 5: Typing-Effect Terminal Renderer

Now implement the terminal renderer, a function that prints each token immediately to stdout without a newline, creating a live typing effect in the terminal. This step demonstrates an important pattern: consuming the stream directly rather than accumulating it first. The renderer calls the streaming request, SSE parsing, and token extraction functions, but skips the accumulation loop entirely.

Key Concepts

Unbuffered Output

By default, most languages buffer stdout which means that output is held until the buffer fills or the program exits. For a typing effect you need every token to appear immediately. Each language has its own way to force this:

Language	Unbuffered write
Python	`print(content, end="", flush=True)`
JavaScript / TypeScript	`process.stdout.write(content)`
PHP	`echo $content; ob_flush(); flush();`
Go	`fmt.Fprint(os.Stdout, content)` (stdout is unbuffered by default)
Java	`System.out.print(content); System.out.flush();`

Direct Stream Consumption vs. Accumulation

The stream completion function from Step 4 accumulates everything and returns once the stream is complete. The terminal renderer function prints as it goes, the user sees output before the model has finished generating. Both patterns are valid; the right choice depends on whether the output needs to be complete before it’s useful.

Implementation Guide

Open the renderer file referenced in the code block. Unlike the streaming client, this file has a single method to implement. Review the TODO comments, then implement:

# File: browser/renderer.py
def render_stream_to_terminal(message: str, token: str) -> None:
    print(f"Prompt: {message}\n")
    print("Response: ", end="", flush=True)

    response = make_streaming_request(message, token)

    total_tokens = 0
    finish_reason = "unknown"

    for raw_line in response.iter_lines(decode_unicode=True):
        chunk = parse_sse_line(raw_line)
        if chunk is None:
            continue
        if chunk == "[DONE]":
            break
        content = extract_token_content(chunk)
        if content:
            print(content, end="", flush=True)
            total_tokens += 1
        choices = chunk.get("choices", [])
        if choices and choices[0].get("finish_reason"):
            finish_reason = choices[0]["finish_reason"]

    print()
    print(f"\n[{total_tokens} tokens, finish_reason={finish_reason}]")

The code does the following:

Prints the user’s message as a prompt header before the response begins
Opens the stream and iterates SSE lines directly, without an accumulation loop, so tokens are available to print as soon as they arrive
Writes each token to stdout without a trailing newline and flushes immediately, producing a character-by-character typing effect
Prints a summary line with the total token count and finish reason after the stream ends

✓ Checkpoint: Terminal Renderer

Run the validation test:

python tests/step5_renderer_test.py

Your output should look similar to the following:

🧪 Testing: Typing-Effect Renderer

✓ Token obtained

Test 1: render_stream_to_terminal() — streaming to terminal...
Prompt: Reply with exactly: Hello streaming world

Response: Hello streaming world

[2 tokens, finish_reason=stop]
✓ Prompt header printed
✓ Response label printed
✓ Token summary found: 2 tokens, finish_reason=stop

✅ Typing-effect renderer working.
   Next: Server-Side Proxy

With a short prompt like this, tokens arrive so quickly that the typing effect may not be visible — the response appears all at once. That’s expected. In production, longer AI responses make the effect clear: each token renders as it arrives rather than waiting for the full response. This is the pattern your chat UI will use.

If tests fail, check:

Each token is written with no trailing newline
flush() or equivalent is called after each write
The summary line format is [N tokens, finish_reason=X]

Step 6: Server-Side Proxy

In this step you’ll implement the proxy server’s stream handler. This is the route that receives requests from browser clients, forwards them upstream to Gloo AI with a server-side auth token, and pipes the SSE response back.

Key Concepts

Why a Proxy?

Browser JavaScript cannot safely include API credentials because anything in client code is visible to anyone who opens DevTools. A proxy server is the standard solution: the browser POSTs to your server, your server adds the auth token and POSTs to Gloo AI, and the SSE stream flows back through your server to the browser. An additional benefit: the proxy can add rate limiting, logging, and multi-tenant auth logic without touching client code.

SSE Headers That Matter

Three headers tell the browser (and any reverse proxies like nginx) that this is a live stream, not a buffered response:

Header	Value	Why
`Content-Type`	`text/event-stream`	Identifies the SSE protocol
`Cache-Control`	`no-cache`	Prevents browser caching of the stream
`X-Accel-Buffering`	`no`	Disables nginx buffering so bytes arrive immediately

Language-Specific Flushing

Each language needs an explicit flush mechanism to push bytes to the client immediately:

Language	Flush mechanism
Python (Flask)	`yield` from a generator — Flask flushes on each `yield`
JavaScript/TypeScript	`res.write()` — Express sends immediately
PHP	`flush()` after each write
Go	`flusher.Flush()` — requires `http.Flusher` interface
Java	`out.flush()` after each write

Implementation Guide

Open the proxy server file referenced in the code block. The server setup and routing are already in place. Find the stream handler method (or route handler, depending on the language), review the TODO comments, and implement the relay logic:

# File: proxy/server.py
@app.route("/api/stream", methods=["POST", "OPTIONS"])
def stream_proxy():
    if request.method == "OPTIONS":
        return Response(status=204)

    request_data = request.get_json() or {}

    def generate():
        try:
            auth_token = ensure_valid_token()
            headers = {
                "Authorization": f"Bearer {auth_token}",
                "Content-Type": "application/json",
            }
            payload = {**request_data, "stream": True}

            with requests.post(
                API_URL, headers=headers, json=payload, stream=True
            ) as resp:
                if resp.status_code != 200:
                    yield f'data: {{"error": "API error {resp.status_code}"}}\n\n'
                    return

                for line in resp.iter_lines():
                    if line:
                        decoded = line.decode("utf-8")
                        yield f"{decoded}\n\n"

        except Exception as e:
            yield f'data: {{"error": "{str(e)}"}}\n\n'

    return Response(
        generate(),
        mimetype="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",
        },
    )

The code does the following:

Sets Content-Type: text/event-stream, Cache-Control: no-cache, and X-Accel-Buffering: no before writing any response data
Handles OPTIONS preflight requests immediately so browsers can POST cross-origin
Retrieves a fresh auth token using the pre-built token manager, keeping credentials server-side
Reads the incoming request body, injects stream: true, and forwards the request to the Gloo AI API
Relays each non-blank SSE line to the client and flushes immediately so tokens reach the browser as they arrive
Writes a structured error SSE frame if the upstream request fails, avoiding a silent stream close

PHP, Go, and Java use a generic HTTP handler that receives all request methods, so they include an explicit 405 check before the streaming logic. Python, JavaScript, and TypeScript register the route for POST only, so the framework rejects other methods automatically.

✓ Checkpoint: Proxy Server

Run the proxy server validation test:

python tests/step6_proxy_test.py

Your output should look similar to the following:

🧪 Testing: Server-Side Proxy

Test 1: Starting proxy server on port 3001...
 * Serving Flask app 'proxy.server'
 * Debug mode: off
✓ Proxy server running at http://localhost:3001

Test 2: /health endpoint...
✓ /health returns: {'service': 'completions-streaming-proxy', 'status': 'ok'}

Test 3: POST /api/stream — Content-Type header...
✓ Content-Type: text/event-stream; charset=utf-8

Test 4: SSE line format (data: prefix)...
✓ All lines have 'data: ' prefix (3 data chunks received)
✓ Stream terminated cleanly (finish_reason=stop)

Test 5: CORS headers on response...
✓ Access-Control-Allow-Origin: http://localhost:3000

✅ Proxy server relaying SSE end-to-end.
   Proxy complete: credentials stay server-side, client receives SSE.

If tests fail, check:

CORS headers are set before sending the response headers (Java)
X-Accel-Buffering: no is present (required to disable nginx buffering)
Go: the flusher interface assertion must succeed — this panics if the ResponseWriter doesn’t support flushing
PHP: clear any existing output buffers before setting SSE headers

Step 7: Testing Your Complete Implementation

With all six steps implemented, you can now run the full demo, test the proxy server via API, and explore the browser demo.

Run the Demo Script

The entry point runs both examples back-to-back: first it accumulates a full response and prints it, then it streams a second response to the terminal with a typing effect.

python main.py

Your output should look similar to:

Streaming AI Responses in Real Time

Environment variables loaded

Example: Streaming a completion (accumulate full text)...

Full response:
The resurrection of Jesus Christ is a cornerstone of Christian 
faith, holding profound significance for believers. It's not 
merely a historical event but a theological truth that reshapes 
our understanding of God, humanity, and...

Received 16 tokens in 6864ms
  Finish reason: stop

Example: Typing-effect rendering...
Prompt: Tell me about Christian discipleship.

Response: Christian discipleship is a transformative journey of 
following Jesus Christ, learning from His teachings, and striving 
to live a life that reflects His character and mission...

[11 tokens, finish_reason=stop]

Test the Proxy Server via API

Start the proxy server in one terminal:

python proxy/server.py

Then send a request from another terminal using curl:

curl -X POST http://localhost:3001/api/stream \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}], "auto_routing": true}'

You will see the SSE stream arrive line by line:

data: {"id": "gen-abc123", "choices": [{"delta": {"content": "Hello", "function_call": null, "refusal": null, "role": "assistant", "tool_calls": null}, "finish_reason": null, "index": 0, "logprobs": null, "native_finish_reason": null}], "created": 1774527271, "model": "google/gemini-2.5-flash", "object": "chat.completion.chunk", "service_tier": null, "system_fingerprint": null, "usage": null, "provider": "Gloo AI", "ttft_ms": 940.61}

data: {"id": "gen-abc123", "choices": [{"delta": {"content": "! How", "function_call": null, "refusal": null, "role": "assistant", "tool_calls": null}, "finish_reason": null, "index": 0, "logprobs": null, "native_finish_reason": null}], "created": 1774527271, "model": "google/gemini-2.5-flash", "object": "chat.completion.chunk", "service_tier": null, "system_fingerprint": null, "usage": null, "provider": "Gloo AI"}

data: {"id": "gen-abc123", "choices": [{"delta": {"content": " can I help you today?", "function_call": null, "refusal": null, "role": "assistant", "tool_calls": null}, "finish_reason": null, "index": 0, "logprobs": null, "native_finish_reason": null}], "created": 1774527271, "model": "google/gemini-2.5-flash", "object": "chat.completion.chunk", "service_tier": null, "system_fingerprint": null, "usage": null, "provider": "Gloo AI"}

data: {"id": "gen-abc123", "choices": [{"delta": {"content": "", "function_call": null, "refusal": null, "role": "assistant", "tool_calls": null}, "finish_reason": "stop", "index": 0, "logprobs": null, "native_finish_reason": "STOP"}], "created": 1774527271, "model": "google/gemini-2.5-flash", "object": "chat.completion.chunk", "service_tier": null, "system_fingerprint": null, "usage": null, "provider": "Gloo AI"}

Each line is a JSON-encoded delta. The final chunk signals the end of the stream with a non-null finish_reason.

Browser Demo

The browser demo is a standalone HTML file separate from the language starter projects — no install step required.

frontend-example/

Download or clone this directory alongside your language starter

The file connects to the proxy over HTTP, so it works with any language’s proxy server. With the proxy already running on port 3001, serve the browser client from the frontend-example/ directory using whichever tool you have available:

# Node
npx serve

# Python
python -m http.server 3000

# PHP
php -S localhost:3000

Do not open index.html directly via File > Open. When loaded as a file:// URL, the browser reports Origin: null, which the proxy’s CORS policy rejects. You must serve the file over HTTP so the origin is http://localhost:3000.

Then open http://localhost:3000 in your browser, type a question, and click Send — tokens appear one by one as they arrive from the proxy.

Gloo AI Streaming Demo browser page showing a streamed response to "What is my purpose in life"

How the Browser Connects to the Stream

Browsers have a built-in API called EventSource designed for receiving server-sent events — but it only supports GET requests. Since the completions API requires a POST body containing the message text, EventSource can’t be used here. Instead, the demo page uses fetch() with a ReadableStream, which supports any HTTP method:

// File: frontend-example/index.html
const response = await fetch("http://localhost:3001/api/stream", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ messages: [{ role: "user", content: message }] }),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

The ReadableStream API works identically to what you used in the terminal renderer — the same line buffer, SSE parser, and token extractor pattern applies.

Markdown Rendering

AI responses often contain markdown. Inserting raw tokens directly into the DOM produces broken mid-stream output — **bo appears before ld** closes the bold span. The correct pattern is to accumulate tokens and re-parse the full buffer on each token:

// File: frontend-example/index.html
let buffer = "";

// On each token:
buffer += content;
outputEl.innerHTML = DOMPurify.sanitize(marked.parse(buffer));

marked.parse() runs on every token — slightly redundant but always produces valid HTML. DOMPurify.sanitize() prevents XSS from any HTML in the AI response.

For production, serve the browser client from the same origin as the proxy, or set PROXY_CORS_ORIGIN in your .env to match your frontend domain.

For React applications, the Vercel AI SDK useChat hook handles streaming, markdown rendering, and state management out of the box — it’s a higher-level alternative to building this pattern manually.

Troubleshooting

Stream hangs and never produces output : Verify "stream": true is in the request payload. Without it, the API returns a single buffered JSON response so the connection may appear to hang while waiting for a response format that never arrives. Garbled or split tokens : The line buffer is missing or incorrect. In JS/TS/PHP, raw bytes must be accumulated and split on \n before parsing. Make sure buffer = lines.pop() saves the incomplete last fragment. Authentication failed (401) : Your .env file is missing GLOO_CLIENT_ID or GLOO_CLIENT_SECRET, or the values are incorrect. Run the Step 1 checkpoint to verify credentials load correctly. Browser blocks direct API calls (CORS error) : Browsers enforce same-origin policy. Direct calls from browser JavaScript to platform.ai.gloo.com will be blocked. Use the proxy server (Step 6) so API calls happen server-side. Failed to fetch when serving the browser demo on a port other than 3000 : The proxy allows requests only from http://localhost:3000 by default. If your file server uses a different port (e.g. VS Code / Cursor Live Server on port 5500, or python -m http.server 8080), the browser’s Origin header won’t match and the proxy blocks the request. Fix: set PROXY_CORS_ORIGIN in your .env to the exact origin shown in your browser’s address bar, then restart the proxy.

# .env — must be an exact match including hostname
PROXY_CORS_ORIGIN=http://127.0.0.1:5500  # Cursor / VS Code Live Server

Note that http://localhost:5500 and http://127.0.0.1:5500 are treated as different origins by the browser even though they resolve to the same address. Copy the origin directly from the address bar to avoid a mismatch. PHP output appears all at once : PHP’s output buffering is active. Call ob_end_flush() (or while (ob_get_level() > 0) ob_end_flush()) before the SSE loop to disable buffering. Go panics on w.(http.Flusher) : Your http.ResponseWriter doesn’t implement http.Flusher. This shouldn’t happen with the standard net/http server, but will happen with some test wrappers. Make sure you’re using http.ResponseWriter directly. Mid-stream disconnect loses all output : Wrap the read loop in try/catch (or check errors in Go). If fullText already has content when the error occurs, return it rather than re-raising — partial responses are usually more useful than nothing. Broken markdown mid-stream : Do not insert raw tokens into innerHTML. Accumulate the full buffer and call marked.parse(buffer) on every token — this ensures the markdown is always valid HTML at each step.

View the Completed Project

If you want to see a working reference before or after completing the steps, the final project is available in the tutorial repository:

Completed Project

Browse the complete implementation in all six languages — Python, JavaScript, TypeScript, PHP, Go, and Java.

Next Steps

Grounded Completions — add retrieved context from your content library to improve response accuracy
Tool Use — combine streaming with function calling for real-time tool-augmented responses
Completions API reference — explore all available parameters including tradition, model_family, and model

Completions with Tool UseLearn how to extend model capabilities via the Completions API with tool use.

⌘I

​Overview

​What You’ll Build

​Understanding Server-Sent Events

​Two Approaches: Direct vs. Proxy

​Prerequisites

​Getting Started with the Starter Project

​Download the Starter Code

Python

JavaScript

TypeScript

PHP

Go

Java

​Quick Setup

​Test Your Setup

​Architecture Overview

​Component Architecture

​Implementation Roadmap

​Step 1: Environment Setup & Auth Verification

​What You’ll Verify

​Testing Your Setup

​✓ Checkpoint: Auth Verification

​Step 2: Streaming-Aware Error Handling

​Key Concepts

​Two-Phase Error Handling

​Implementation Guide

​✓ Checkpoint: Error Handling

​Step 3: Streaming Requests & SSE Parsing

​What You’ll Implement

​Making the Streaming Request

​Why stream: true Changes Everything

​Fail Fast Before Reading

​Implementation Guide

​Parsing SSE Lines

​The SSE Wire Format

​Defensive Parsing

​Implementation Guide

​✓ Checkpoint: Streaming Request & SSE Parsing

​Step 4: Token Extraction & Accumulation

​What You’ll Implement

​Extracting Token Content

​Why Content Can Be Absent

​Null-Safe Navigation

​Implementation Guide

​Accumulating the Full Response

​Two Ways to Consume a Stream

​The Line Buffer (JS/TS/PHP)

​Implementation Guide

​✓ Checkpoint: Token Extraction & Accumulation

​Step 5: Typing-Effect Terminal Renderer

​Key Concepts

​Unbuffered Output

​Direct Stream Consumption vs. Accumulation

​Implementation Guide

​✓ Checkpoint: Terminal Renderer

​Step 6: Server-Side Proxy

​Key Concepts

​Why a Proxy?

​SSE Headers That Matter

​Language-Specific Flushing

​Implementation Guide

​✓ Checkpoint: Proxy Server

​Step 7: Testing Your Complete Implementation

​Run the Demo Script

​Test the Proxy Server via API

​Browser Demo

frontend-example/

​How the Browser Connects to the Stream

​Markdown Rendering

​Troubleshooting

​View the Completed Project

Completed Project

​Next Steps

Overview

What You’ll Build

Understanding Server-Sent Events

Two Approaches: Direct vs. Proxy

Prerequisites

Getting Started with the Starter Project

Download the Starter Code

Quick Setup

Test Your Setup

Architecture Overview

Component Architecture

Implementation Roadmap

Step 1: Environment Setup & Auth Verification

What You’ll Verify

Testing Your Setup

✓ Checkpoint: Auth Verification

Step 2: Streaming-Aware Error Handling

Key Concepts

Two-Phase Error Handling

Implementation Guide

✓ Checkpoint: Error Handling

Step 3: Streaming Requests & SSE Parsing

What You’ll Implement

Making the Streaming Request

Why `stream: true` Changes Everything

Fail Fast Before Reading

Implementation Guide

Parsing SSE Lines

The SSE Wire Format

Defensive Parsing

Implementation Guide

✓ Checkpoint: Streaming Request & SSE Parsing

Step 4: Token Extraction & Accumulation

What You’ll Implement

Extracting Token Content

Why Content Can Be Absent

Null-Safe Navigation

Implementation Guide

Accumulating the Full Response

Two Ways to Consume a Stream

The Line Buffer (JS/TS/PHP)

Implementation Guide

✓ Checkpoint: Token Extraction & Accumulation

Step 5: Typing-Effect Terminal Renderer

Key Concepts

Unbuffered Output

Direct Stream Consumption vs. Accumulation

Implementation Guide

✓ Checkpoint: Terminal Renderer

Step 6: Server-Side Proxy

Key Concepts

Why a Proxy?

SSE Headers That Matter

Language-Specific Flushing

Implementation Guide

✓ Checkpoint: Proxy Server

Step 7: Testing Your Complete Implementation

Run the Demo Script

Test the Proxy Server via API

Browser Demo

How the Browser Connects to the Stream

Markdown Rendering

Troubleshooting

View the Completed Project

Next Steps