Guides

Building AI API Cost Optimization with open-source tools

Transitioning an AI application from prototype to production often results in exponential cost increases. This guide provides a technical roadmap for implementing a cost-optimization layer using model routing, semantic caching, and batch processing to reduce LLM spend by up to 80% without sacrificing output quality.

4-6 hours6 steps

Centralize API Calls via a Unified Proxy

Replace direct SDK calls with a proxy layer like LiteLLM. This allows you to swap models, track costs globally, and implement fallbacks without changing business logic in multiple files.

proxy_config.py

from litellm import completion

def get_response(model_alias, messages):
    # Map aliases to specific provider models in a central config
    model_map = {
        "cheap": "gpt-4o-mini",
        "quality": "claude-3-5-sonnet-20240620"
    }
    return completion(model=model_map[model_alias], messages=messages)

⚠ Common Pitfalls

•Adding a proxy can introduce minor latency; ensure the proxy is deployed in the same region as your application.
•Hardcoding model strings inside components instead of using the central map.

Implement Semantic Caching with Redis

Standard exact-match caching is rarely effective for LLMs. Use a vector-based semantic cache to return results for prompts that are semantically identical (e.g., 'How's the weather?' vs 'What is the weather like?').

cache_layer.py

from langchain.cache import RedisSemanticCache
from langchain.embeddings import OpenAIEmbeddings
import langchain

langchain.llm_cache = RedisSemanticCache(
    redis_url="redis://localhost:6379",
    embedding=OpenAIEmbeddings(),
    score_threshold=0.05
)

⚠ Common Pitfalls

•Setting the similarity threshold too low can return irrelevant answers for distinct questions.
•Caching sensitive user data across different user sessions.

Deploy a Tiered Model Routing Logic

Categorize tasks by complexity. Use small models (GPT-4o mini, Gemini Flash) for classification, summarization, and extraction. Reserve flagship models (GPT-4o, Claude Opus) only for complex reasoning or final creative output.

router.js

async function routeRequest(taskType, payload) {
  const model = taskType === 'classification' ? 'gpt-4o-mini' : 'gpt-4o';
  return await litellm.completion({ model, messages: payload });
}

⚠ Common Pitfalls

•Underestimating the reasoning requirements of a 'simple' task, leading to downstream errors.
•Failure to monitor the error rates of the cheaper model tier.

Utilize Batch API for Asynchronous Tasks

For non-real-time tasks like bulk data processing or nightly content generation, use the OpenAI or Anthropic Batch APIs. These provide a 50% discount compared to standard request-response pricing by processing requests within 24 hours.

batch_processor.py

import openai

# Create a batch job for offline processing
openai.batches.create(
    input_file_id="file-xyz123",
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

⚠ Common Pitfalls

•Using Batch API for interactive features where users expect immediate feedback.
•Forgetting to implement a webhook or polling mechanism to retrieve results once the batch is complete.

Enforce Token Budgets and Truncation

Set hard limits on `max_tokens` and implement aggressive context window management. Use tiktoken to count tokens locally before sending the request to prevent over-spending on massive input contexts.

token_manager.py

import tiktoken

def truncate_messages(messages, model="gpt-4o", limit=4000):
    enc = tiktoken.encoding_for_model(model)
    # Logic to remove oldest messages until token count is below limit
    pass

⚠ Common Pitfalls

•Truncating the system prompt or the most recent user instruction, which breaks the model's ability to follow directions.
•Inaccurate token counting for non-OpenAI models if using an OpenAI-specific tokenizer.

Implement Per-User and Per-Feature Cost Tracking

Inject custom headers or metadata (like 'user_id' or 'feature_id') into your API calls. Use a tool like Helicone or LangSmith to visualize which users or features are consuming the most budget to inform pricing and product decisions.

tracking.py

response = completion(
    model="gpt-4o-mini",
    messages=messages,
    metadata={
        "user_id": "user_123",
        "feature": "search_summarization"
    }
)

⚠ Common Pitfalls

•Failing to reconcile internal tracking with the provider's monthly invoice.
•Privacy concerns: ensure no PII is sent in metadata fields to third-party monitoring tools.

What you built

By moving from direct API calls to a structured middleware approach, you gain the visibility needed to control costs. Start with model routing for the quickest wins, then implement semantic caching and batching as your traffic volume justifies the architectural complexity.