Guides

Building FastAPI with open-source tools

This guide outlines the implementation of a production-ready FastAPI architecture designed for high-concurrency LLM streaming and asynchronous database management. It focuses on solving common bottlenecks in project structure, connection pooling, and non-blocking I/O.

45 minutes5 steps

Establish a Scalable Directory Structure

Move beyond single-file applications by separating concerns into discrete modules. Use an 'app' directory containing 'api' (routes), 'core' (config/security), 'models' (SQLAlchemy), and 'schemas' (Pydantic). This prevents circular imports and simplifies testing.

project_root/
├── app/
│   ├── api/          # Route handlers
│   ├── core/         # Config, security, constants
│   ├── models/       # Database models
│   ├── schemas/      # Pydantic models
│   ├── services/     # Business logic/LLM clients
│   └── main.py       # Application entry point
├── alembic/          # Database migrations
└── docker-compose.yml

⚠ Common Pitfalls

•Placing business logic directly inside route handlers, making unit testing difficult.
•Circular dependencies between models and schemas.

Configure Asynchronous SQLAlchemy and Connection Pooling

Use 'sqlalchemy.ext.asyncio' with 'asyncpg' to handle database operations without blocking the event loop. Configure the engine with 'pool_size' and 'max_overflow' based on your expected concurrent user load.

app/core/db.py

from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker, AsyncSession

DATABASE_URL = "postgresql+asyncpg://user:pass@localhost/db"

engine = create_async_engine(
    DATABASE_URL,
    pool_size=20,
    max_overflow=10,
    pool_pre_ping=True
)

AsyncSessionLocal = async_sessionmaker(
    bind=engine,
    autoflush=False,
    autocommit=False,
    expire_on_commit=False
)

⚠ Common Pitfalls

•Using a synchronous driver like 'psycopg2' which blocks the FastAPI event loop.
•Setting pool_size too high, exceeding the PostgreSQL 'max_connections' limit.

Implement Dependency Injection for Database Sessions

Create a dependency that yields a session to ensure every request gets its own connection and that the connection is properly closed after the request finishes, even if an exception occurs.

app/api/deps.py

from typing import AsyncGenerator
from app.core.db import AsyncSessionLocal

async def get_db() -> AsyncGenerator:
    async with AsyncSessionLocal() as session:
        try:
            yield session
            await session.commit()
        except Exception:
            await session.rollback()
            raise
        finally:
            await session.close()

⚠ Common Pitfalls

•Forgetting to call 'await session.close()', leading to connection leaks.
•Sharing a single session object across multiple concurrent requests.

Create a Streaming LLM Response Endpoint

Use FastAPI's 'StreamingResponse' with an async generator to pipe tokens from an LLM provider (like OpenAI or a local vLLM instance) to the client. This reduces perceived latency by providing data as it is generated.

app/api/v1/llm.py

from fastapi import APIRouter
from fastapi.responses import StreamingResponse

router = APIRouter()

async def mock_llm_generator():
    tokens = ["FastAPI", " is", " extremely", " fast."]
    for token in tokens:
        yield f"data: {token}\n\n"
        await asyncio.sleep(0.1)

@router.get("/stream")
async def stream_llm():
    return StreamingResponse(mock_llm_generator(), media_type="text/event-stream")

⚠ Common Pitfalls

•Not setting the correct 'media_type', which can cause browsers to buffer the response.
•Using a synchronous generator that blocks the worker while waiting for the next token.

Optimize Deployment with Gunicorn and Uvicorn Workers

For production, run FastAPI behind Gunicorn using the Uvicorn worker class. This provides process management (Gunicorn) while retaining async capabilities (Uvicorn). Set the number of workers to (2 x $num_cores) + 1.

Dockerfile

CMD ["gunicorn", "-w", "4", "-k", "uvicorn.workers.UvicornWorker", "app.main:app", "--bind", "0.0.0.0:8000"]

⚠ Common Pitfalls

•Running with '--reload' in production, which significantly degrades performance.
•Using an insufficient number of workers for CPU-bound tasks like Pydantic validation on large payloads.

What you built

Following this architecture ensures that your FastAPI application remains maintainable as it grows and performs efficiently under load. By utilizing async database sessions and streaming responses, you maximize the throughput of your hardware while providing a responsive experience for AI-driven features.