Guides

Building Caching Strategies with open-source tools

This guide outlines a production-grade caching architecture designed to minimize latency and reduce AI API costs. It covers the implementation of multi-layer caching, including semantic caching for LLM responses and tag-based invalidation for distributed systems.

4 hours5 steps

Configure Redis Eviction and Persistence Policies

Ensure Redis is configured as a cache rather than a primary database. Set the 'maxmemory-policy' to 'allkeys-lru' to automatically evict the least recently used keys when memory is full. Disable heavy RDB snapshots if data persistence is not critical for the cache layer to save on I/O overhead.

redis.conf

redis-cli config set maxmemory 2gb
redis-cli config set maxmemory-policy allkeys-lru
redis-cli config set save ""

⚠ Common Pitfalls

•Setting policy to 'volatile-lru' when keys lack TTLs, leading to OOM errors.
•Underestimating memory fragmentation in long-running instances.

Implement Semantic Caching for LLM Responses

Instead of exact string matches, use vector embeddings to cache LLM responses. When a user submits a prompt, generate an embedding and query Redis (using RediSearch/RedisVL) for existing vectors within a specific distance threshold (e.g., cosine similarity > 0.95).

semantic-cache.js

import { RedisVectorStore } from "langchain/vectorstores/redis";

const cache = await RedisVectorStore.fromExistingIndex(
  new OpenAIEmbeddings(),
  { redisClient: client, indexName: "llm_cache" }
);

const result = await cache.similaritySearch(userPrompt, 1);
if (result.length > 0 && result[0].metadata.score > 0.98) {
  return result[0].pageContent;
}

⚠ Common Pitfalls

•Setting the similarity threshold too low, causing the cache to serve irrelevant answers.
•Ignoring the cost of embedding generation which can offset cache savings for very short prompts.

Deploy Stale-While-Revalidate (SWR) at the Edge

Configure your CDN or Edge Runtime to serve stale content while fetching updates in the background. This ensures zero-latency responses for the end user even when the cache has expired. Use the 'Cache-Control' header with 'stale-while-revalidate'.

edge-handler.js

export default async function handler(req, res) {
  res.setHeader('Cache-Control', 'public, s-maxage=60, stale-while-revalidate=300');
  const data = await fetchDataFromOrigin();
  res.status(200).json(data);
}

⚠ Common Pitfalls

•Caching PII (Personally Identifiable Information) at the edge by failing to set 'Vary: Cookie' or 'private' headers.
•Setting s-maxage too high for frequently changing data.

Implement Tag-Based Cache Invalidation

Avoid relying solely on TTLs. Implement a tagging system where cache entries are grouped by entity (e.g., 'user:123', 'post:456'). When an entity is updated, purge all cache keys associated with that tag across your distributed layers.

invalidation.js

async function updatePost(postId, newData) {
  await db.posts.update(postId, newData);
  // Purge application cache
  await redis.del(`post_data:${postId}`);
  // Purge CDN cache via API
  await fetch(`https://api.cloudflare.com/client/v4/zones/${ZONE_ID}/purge_cache`, {
    method: 'POST',
    body: JSON.stringify({ tags: [`post-${postId}`] })
  });
}

⚠ Common Pitfalls

•Creating a 'thundering herd' problem where all edge nodes hit the origin simultaneously after a purge.
•Inconsistent state if the database update succeeds but the cache purge fails.

Instrument Cache Hit Rate Monitoring

Track the effectiveness of your caching strategy. Export metrics to a dashboard (Grafana/Datadog) to visualize 'Hit', 'Miss', and 'Revalidate' events. Aim for a hit rate above 80% for static assets and 30-50% for LLM responses.

telemetry.js

const startTime = Date.now();
const cachedValue = await redis.get(key);

if (cachedValue) {
  metrics.increment('cache.hit', { layer: 'redis' });
  return JSON.parse(cachedValue);
}

metrics.increment('cache.miss', { layer: 'redis' });
const freshData = await fetchFromOrigin();
await redis.setex(key, 3600, JSON.stringify(freshData));

⚠ Common Pitfalls

•Monitoring only the global hit rate instead of per-route or per-client hit rates.
•Overlooking 'Cache-Control: no-cache' headers sent by clients which bypass the cache.

What you built

Successful caching requires a multi-layered approach that combines deterministic key-value lookups with modern semantic search for AI. By implementing SWR at the edge and robust tag-based invalidation at the backend, you can achieve sub-100ms response times and significantly reduce operational costs.