Building Caching Strategies with open-source tools
This guide outlines a production-grade caching architecture designed to minimize latency and reduce AI API costs. It covers the implementation of multi-layer caching, including semantic caching for LLM responses and tag-based invalidation for distributed systems.
Configure Redis Eviction and Persistence Policies
Ensure Redis is configured as a cache rather than a primary database. Set the 'maxmemory-policy' to 'allkeys-lru' to automatically evict the least recently used keys when memory is full. Disable heavy RDB snapshots if data persistence is not critical for the cache layer to save on I/O overhead.
redis-cli config set maxmemory 2gb
redis-cli config set maxmemory-policy allkeys-lru
redis-cli config set save ""⚠ Common Pitfalls
- •Setting policy to 'volatile-lru' when keys lack TTLs, leading to OOM errors.
- •Underestimating memory fragmentation in long-running instances.
Implement Semantic Caching for LLM Responses
Instead of exact string matches, use vector embeddings to cache LLM responses. When a user submits a prompt, generate an embedding and query Redis (using RediSearch/RedisVL) for existing vectors within a specific distance threshold (e.g., cosine similarity > 0.95).
import { RedisVectorStore } from "langchain/vectorstores/redis";
const cache = await RedisVectorStore.fromExistingIndex(
new OpenAIEmbeddings(),
{ redisClient: client, indexName: "llm_cache" }
);
const result = await cache.similaritySearch(userPrompt, 1);
if (result.length > 0 && result[0].metadata.score > 0.98) {
return result[0].pageContent;
}⚠ Common Pitfalls
- •Setting the similarity threshold too low, causing the cache to serve irrelevant answers.
- •Ignoring the cost of embedding generation which can offset cache savings for very short prompts.
Deploy Stale-While-Revalidate (SWR) at the Edge
Configure your CDN or Edge Runtime to serve stale content while fetching updates in the background. This ensures zero-latency responses for the end user even when the cache has expired. Use the 'Cache-Control' header with 'stale-while-revalidate'.
export default async function handler(req, res) {
res.setHeader('Cache-Control', 'public, s-maxage=60, stale-while-revalidate=300');
const data = await fetchDataFromOrigin();
res.status(200).json(data);
}⚠ Common Pitfalls
- •Caching PII (Personally Identifiable Information) at the edge by failing to set 'Vary: Cookie' or 'private' headers.
- •Setting s-maxage too high for frequently changing data.
Implement Tag-Based Cache Invalidation
Avoid relying solely on TTLs. Implement a tagging system where cache entries are grouped by entity (e.g., 'user:123', 'post:456'). When an entity is updated, purge all cache keys associated with that tag across your distributed layers.
async function updatePost(postId, newData) {
await db.posts.update(postId, newData);
// Purge application cache
await redis.del(`post_data:${postId}`);
// Purge CDN cache via API
await fetch(`https://api.cloudflare.com/client/v4/zones/${ZONE_ID}/purge_cache`, {
method: 'POST',
body: JSON.stringify({ tags: [`post-${postId}`] })
});
}⚠ Common Pitfalls
- •Creating a 'thundering herd' problem where all edge nodes hit the origin simultaneously after a purge.
- •Inconsistent state if the database update succeeds but the cache purge fails.
Instrument Cache Hit Rate Monitoring
Track the effectiveness of your caching strategy. Export metrics to a dashboard (Grafana/Datadog) to visualize 'Hit', 'Miss', and 'Revalidate' events. Aim for a hit rate above 80% for static assets and 30-50% for LLM responses.
const startTime = Date.now();
const cachedValue = await redis.get(key);
if (cachedValue) {
metrics.increment('cache.hit', { layer: 'redis' });
return JSON.parse(cachedValue);
}
metrics.increment('cache.miss', { layer: 'redis' });
const freshData = await fetchFromOrigin();
await redis.setex(key, 3600, JSON.stringify(freshData));⚠ Common Pitfalls
- •Monitoring only the global hit rate instead of per-route or per-client hit rates.
- •Overlooking 'Cache-Control: no-cache' headers sent by clients which bypass the cache.
What you built
Successful caching requires a multi-layered approach that combines deterministic key-value lookups with modern semantic search for AI. By implementing SWR at the edge and robust tag-based invalidation at the backend, you can achieve sub-100ms response times and significantly reduce operational costs.