Guides

Building Streaming LLM Responses with open-source tools

This guide provides a technical implementation path for streaming LLM responses using Server-Sent Events (SSE). It focuses on minimizing perceived latency, bypassing edge-network buffering, and managing UI state during high-frequency token updates.

45 minutes6 steps

Configure Server-Side Stream Generation

Initialize the LLM provider client with streaming enabled. You must return a ReadableStream to the client to avoid waiting for the full completion. Using the Vercel AI SDK simplifies this by wrapping the provider's stream into a standard response format.

app/api/chat/route.ts

import { OpenAIStream, StreamingTextResponse } from 'ai';
import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

export async function POST(req: Request) {
  const { messages } = await req.json();
  const response = await openai.chat.completions.create({
    model: 'gpt-4',
    stream: true,
    messages,
  });

  const stream = OpenAIStream(response);
  return new StreamingTextResponse(stream);
}

⚠ Common Pitfalls

•Setting 'stream: false' will cause the entire response to buffer on the server, negating the UX benefits.
•Ensure your server environment does not have a low execution timeout (e.g., 10s) for long-running stream requests.

Bypass Proxy and Middleware Buffering

Intermediate proxies like NGINX or Cloudflare, and even some framework middlewares, may buffer the response to perform compression or logging. You must explicitly disable buffering via headers to ensure tokens reach the client instantly.

middleware.ts

export const config = {
  matcher: '/api/chat',
};

// Ensure response headers prevent buffering
// X-Accel-Buffering: no is specific to NGINX
// Content-Encoding: none prevents GZIP buffering issues

⚠ Common Pitfalls

•Cloudflare Rocket Loader or certain 'Polish' settings can interfere with SSE streams.
•Forgetting 'X-Accel-Buffering: no' will cause the response to arrive in large chunks rather than token-by-token.

Implement the Client-Side Stream Consumer

On the frontend, use the Fetch API to read the response body as a ReadableStream. Iterate through the stream using a reader to update the UI state incrementally.

hooks/use-chat-stream.ts

const response = await fetch('/api/chat', { method: 'POST', body: JSON.stringify({ messages }) });
const reader = response.body?.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  const chunk = decoder.decode(value, { stream: true });
  setCompletion((prev) => prev + chunk);
}

⚠ Common Pitfalls

•Memory leaks: Always ensure the loop breaks when 'done' is true.
•TextDecoder must be used with '{ stream: true }' to handle multi-byte characters split across chunks.

Parse Partial JSON for Structured Outputs

If the LLM is streaming structured data (JSON), the client will receive incomplete JSON strings. Use a partial JSON parser to extract available fields without waiting for the closing braces.

utils/parse-partial.ts

import { parse } from 'best-effort-json';

// Inside the stream loop
const partialData = parse(accumulatedString);
updateUI(partialData);

⚠ Common Pitfalls

•JSON.parse() will throw an error on every chunk until the stream finishes. Never use it inside the streaming loop without a try/catch or a specialized parser.

Optimize UI Rendering and Auto-Scroll

Frequent state updates (every token) can cause layout thrashing. Implement a scroll-to-bottom behavior that only triggers if the user is already near the bottom, allowing them to scroll up to read previous messages without being snapped back.

components/ChatWindow.tsx

useEffect(() => {
  if (isAtBottom) {
    scrollRef.current?.scrollIntoView({ behavior: 'auto' });
  }
}, [messages, isAtBottom]);

⚠ Common Pitfalls

•Using 'smooth' scroll behavior during high-velocity streaming can lead to visual lag and jitter.
•Failing to check 'isAtBottom' will prevent users from reading history while the LLM is still generating.

Handle Connection Interruptions

Streams are long-lived and prone to network drops. Implement an AbortController to allow users to stop generation and a retry mechanism for transient errors.

components/ChatInput.tsx

const abortController = new AbortController();
fetch('/api/chat', { signal: abortController.signal });

const stopGeneration = () => abortController.abort();

⚠ Common Pitfalls

•Not calling .abort() when a component unmounts can lead to ghost requests and wasted API costs.
•Automatic retries on streaming endpoints can lead to duplicate partial messages if not handled with unique message IDs.

What you built

Successfully implementing streaming requires coordination between server headers, stream-compatible edge runtimes, and resilient client-side parsing. By bypassing buffers and handling partial data correctly, you can achieve sub-100ms time-to-first-token, significantly improving the user experience of AI-driven interfaces.