Guides

Building Caching Strategies with open-source tools

This guide outlines a production-grade caching architecture designed to minimize latency and reduce AI API costs. It covers the implementation of multi-layer caching, including semantic caching for LLM responses and tag-based invalidation for distributed systems.

4 hours5 steps
1

Configure Redis Eviction and Persistence Policies

Ensure Redis is configured as a cache rather than a primary database. Set the 'maxmemory-policy' to 'allkeys-lru' to automatically evict the least recently used keys when memory is full. Disable heavy RDB snapshots if data persistence is not critical for the cache layer to save on I/O overhead.

redis.conf
redis-cli config set maxmemory 2gb
redis-cli config set maxmemory-policy allkeys-lru
redis-cli config set save ""

⚠ Common Pitfalls

  • Setting policy to 'volatile-lru' when keys lack TTLs, leading to OOM errors.
  • Underestimating memory fragmentation in long-running instances.
2

Implement Semantic Caching for LLM Responses

Instead of exact string matches, use vector embeddings to cache LLM responses. When a user submits a prompt, generate an embedding and query Redis (using RediSearch/RedisVL) for existing vectors within a specific distance threshold (e.g., cosine similarity > 0.95).

semantic-cache.js
import { RedisVectorStore } from "langchain/vectorstores/redis";

const cache = await RedisVectorStore.fromExistingIndex(
  new OpenAIEmbeddings(),
  { redisClient: client, indexName: "llm_cache" }
);

const result = await cache.similaritySearch(userPrompt, 1);
if (result.length > 0 && result[0].metadata.score > 0.98) {
  return result[0].pageContent;
}

⚠ Common Pitfalls

  • Setting the similarity threshold too low, causing the cache to serve irrelevant answers.
  • Ignoring the cost of embedding generation which can offset cache savings for very short prompts.
3

Deploy Stale-While-Revalidate (SWR) at the Edge

Configure your CDN or Edge Runtime to serve stale content while fetching updates in the background. This ensures zero-latency responses for the end user even when the cache has expired. Use the 'Cache-Control' header with 'stale-while-revalidate'.

edge-handler.js
export default async function handler(req, res) {
  res.setHeader('Cache-Control', 'public, s-maxage=60, stale-while-revalidate=300');
  const data = await fetchDataFromOrigin();
  res.status(200).json(data);
}

⚠ Common Pitfalls

  • Caching PII (Personally Identifiable Information) at the edge by failing to set 'Vary: Cookie' or 'private' headers.
  • Setting s-maxage too high for frequently changing data.
4

Implement Tag-Based Cache Invalidation

Avoid relying solely on TTLs. Implement a tagging system where cache entries are grouped by entity (e.g., 'user:123', 'post:456'). When an entity is updated, purge all cache keys associated with that tag across your distributed layers.

invalidation.js
async function updatePost(postId, newData) {
  await db.posts.update(postId, newData);
  // Purge application cache
  await redis.del(`post_data:${postId}`);
  // Purge CDN cache via API
  await fetch(`https://api.cloudflare.com/client/v4/zones/${ZONE_ID}/purge_cache`, {
    method: 'POST',
    body: JSON.stringify({ tags: [`post-${postId}`] })
  });
}

⚠ Common Pitfalls

  • Creating a 'thundering herd' problem where all edge nodes hit the origin simultaneously after a purge.
  • Inconsistent state if the database update succeeds but the cache purge fails.
5

Instrument Cache Hit Rate Monitoring

Track the effectiveness of your caching strategy. Export metrics to a dashboard (Grafana/Datadog) to visualize 'Hit', 'Miss', and 'Revalidate' events. Aim for a hit rate above 80% for static assets and 30-50% for LLM responses.

telemetry.js
const startTime = Date.now();
const cachedValue = await redis.get(key);

if (cachedValue) {
  metrics.increment('cache.hit', { layer: 'redis' });
  return JSON.parse(cachedValue);
}

metrics.increment('cache.miss', { layer: 'redis' });
const freshData = await fetchFromOrigin();
await redis.setex(key, 3600, JSON.stringify(freshData));

⚠ Common Pitfalls

  • Monitoring only the global hit rate instead of per-route or per-client hit rates.
  • Overlooking 'Cache-Control: no-cache' headers sent by clients which bypass the cache.

What you built

Successful caching requires a multi-layered approach that combines deterministic key-value lookups with modern semantic search for AI. By implementing SWR at the edge and robust tag-based invalidation at the backend, you can achieve sub-100ms response times and significantly reduce operational costs.