Building FastAPI with open-source tools
This guide outlines the implementation of a production-ready FastAPI architecture designed for high-concurrency LLM streaming and asynchronous database management. It focuses on solving common bottlenecks in project structure, connection pooling, and non-blocking I/O.
Establish a Scalable Directory Structure
Move beyond single-file applications by separating concerns into discrete modules. Use an 'app' directory containing 'api' (routes), 'core' (config/security), 'models' (SQLAlchemy), and 'schemas' (Pydantic). This prevents circular imports and simplifies testing.
project_root/
├── app/
│ ├── api/ # Route handlers
│ ├── core/ # Config, security, constants
│ ├── models/ # Database models
│ ├── schemas/ # Pydantic models
│ ├── services/ # Business logic/LLM clients
│ └── main.py # Application entry point
├── alembic/ # Database migrations
└── docker-compose.yml⚠ Common Pitfalls
- •Placing business logic directly inside route handlers, making unit testing difficult.
- •Circular dependencies between models and schemas.
Configure Asynchronous SQLAlchemy and Connection Pooling
Use 'sqlalchemy.ext.asyncio' with 'asyncpg' to handle database operations without blocking the event loop. Configure the engine with 'pool_size' and 'max_overflow' based on your expected concurrent user load.
from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker, AsyncSession
DATABASE_URL = "postgresql+asyncpg://user:pass@localhost/db"
engine = create_async_engine(
DATABASE_URL,
pool_size=20,
max_overflow=10,
pool_pre_ping=True
)
AsyncSessionLocal = async_sessionmaker(
bind=engine,
autoflush=False,
autocommit=False,
expire_on_commit=False
)⚠ Common Pitfalls
- •Using a synchronous driver like 'psycopg2' which blocks the FastAPI event loop.
- •Setting pool_size too high, exceeding the PostgreSQL 'max_connections' limit.
Implement Dependency Injection for Database Sessions
Create a dependency that yields a session to ensure every request gets its own connection and that the connection is properly closed after the request finishes, even if an exception occurs.
from typing import AsyncGenerator
from app.core.db import AsyncSessionLocal
async def get_db() -> AsyncGenerator:
async with AsyncSessionLocal() as session:
try:
yield session
await session.commit()
except Exception:
await session.rollback()
raise
finally:
await session.close()⚠ Common Pitfalls
- •Forgetting to call 'await session.close()', leading to connection leaks.
- •Sharing a single session object across multiple concurrent requests.
Create a Streaming LLM Response Endpoint
Use FastAPI's 'StreamingResponse' with an async generator to pipe tokens from an LLM provider (like OpenAI or a local vLLM instance) to the client. This reduces perceived latency by providing data as it is generated.
from fastapi import APIRouter
from fastapi.responses import StreamingResponse
router = APIRouter()
async def mock_llm_generator():
tokens = ["FastAPI", " is", " extremely", " fast."]
for token in tokens:
yield f"data: {token}\n\n"
await asyncio.sleep(0.1)
@router.get("/stream")
async def stream_llm():
return StreamingResponse(mock_llm_generator(), media_type="text/event-stream")⚠ Common Pitfalls
- •Not setting the correct 'media_type', which can cause browsers to buffer the response.
- •Using a synchronous generator that blocks the worker while waiting for the next token.
Optimize Deployment with Gunicorn and Uvicorn Workers
For production, run FastAPI behind Gunicorn using the Uvicorn worker class. This provides process management (Gunicorn) while retaining async capabilities (Uvicorn). Set the number of workers to (2 x $num_cores) + 1.
CMD ["gunicorn", "-w", "4", "-k", "uvicorn.workers.UvicornWorker", "app.main:app", "--bind", "0.0.0.0:8000"]⚠ Common Pitfalls
- •Running with '--reload' in production, which significantly degrades performance.
- •Using an insufficient number of workers for CPU-bound tasks like Pydantic validation on large payloads.
What you built
Following this architecture ensures that your FastAPI application remains maintainable as it grows and performs efficiently under load. By utilizing async database sessions and streaming responses, you maximize the throughput of your hardware while providing a responsive experience for AI-driven features.