Semantic Cache
Cache Architecture
The StackFlow semantic cache stores AI responses in Redis using vector embeddings for similarity matching. When a new AI query arrives, it is embedded and compared against cached query embeddings. If a sufficiently similar query is found in the cache, the cached response is returned immediately without calling Bedrock. This dramatically reduces latency (cache hits take ~5ms vs 1-3s for Bedrock calls) and cost.
- ElastiCache Redis:
stackflow-redis-prodcluster (cache.t4g.micro) with TLS port 6379 and in-transit encryption enabled - Secrets Manager: Redis auth token at
stackflow/redis/auth-tokenaccessible toStackFlowAPIRole - Lambda:
StackFlowCacheWarmerdeployed with EventBridge schedule (daily 02:00 UTC) and Aurora read access - VPC: Redis and Lambda in same VPC with security group allowing port 6379 inbound from Lambda SG
The Redis cluster is at master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com:6379 with TLS encryption and authentication token. Cache entries include: the original query embedding (vector), the original query text, the cached response, metadata (model used, timestamp, hit count), and TTL.
Similarity Threshold
The similarity threshold controls when a cached response is considered a valid match. StackFlow uses a default threshold of 0.92 (cosine similarity). Queries with similarity above this threshold receive the cached response. Below the threshold, Bedrock is called and the new response is cached.
import redis
import numpy as np
# Redis client (TLS, authenticated)
r = redis.Redis(
host='master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com',
port=6379,
ssl=True,
password=REDIS_AUTH_TOKEN
)
# Check cache hit rate (via INFO stats)
info = r.info('stats')
hits = info['keyspace_hits']
misses = info['keyspace_misses']
hit_rate = hits / (hits + misses) * 100
print(f"Cache hit rate: {hit_rate:.1f}%")
Cache Keys
Cache entries are namespaced by tenant and context to prevent cross-tenant data leakage: t:{tenant_id}:ai:cache:{embedding_hash}. The embedding hash is a 64-bit xxHash of the normalized query text. TTLs are set per query type: AI Copilot responses (1 hour), incident triage (30 minutes), KB article generation (24 hours).
Cache Management
Administrators can manage the semantic cache in AI → Settings → Semantic Cache. Available operations include: view cache statistics, clear cache for a specific query prefix, adjust similarity threshold, and modify TTLs by query type. The cache is automatically evicted using LRU policy when Redis memory usage approaches the configured limit (default: 80% of available memory).
Performance Impact
| Metric | Without Cache | With Cache (40% hit rate) |
|---|---|---|
| P50 AI response latency | ~1.2s | ~0.7s |
| P95 AI response latency | ~3.5s | ~2.1s |
| Daily Bedrock API cost | Baseline | ~40% reduction |
| Bedrock API calls/day | Baseline | ~40% reduction |