Semantic Cache

Cache Architecture

The StackFlow semantic cache stores AI responses in Redis using vector embeddings for similarity matching. When a new AI query arrives, it is embedded and compared against cached query embeddings. If a sufficiently similar query is found in the cache, the cached response is returned immediately without calling Bedrock. This dramatically reduces latency (cache hits take ~5ms vs 1-3s for Bedrock calls) and cost.

⚙️ Minimum Requirements

ElastiCache Redis: stackflow-redis-prod cluster (cache.t4g.micro) with TLS port 6379 and in-transit encryption enabled
Secrets Manager: Redis auth token at stackflow/redis/auth-token accessible to StackFlowAPIRole
Lambda: StackFlowCacheWarmer deployed with EventBridge schedule (daily 02:00 UTC) and Aurora read access
VPC: Redis and Lambda in same VPC with security group allowing port 6379 inbound from Lambda SG

The Redis cluster is at master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com:6379 with TLS encryption and authentication token. Cache entries include: the original query embedding (vector), the original query text, the cached response, metadata (model used, timestamp, hit count), and TTL.

Similarity Threshold

The similarity threshold controls when a cached response is considered a valid match. StackFlow uses a default threshold of 0.92 (cosine similarity). Queries with similarity above this threshold receive the cached response. Below the threshold, Bedrock is called and the new response is cached.

import redis
import numpy as np

# Redis client (TLS, authenticated)
r = redis.Redis(
    host='master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com',
    port=6379,
    ssl=True,
    password=REDIS_AUTH_TOKEN
)

# Check cache hit rate (via INFO stats)
info = r.info('stats')
hits = info['keyspace_hits']
misses = info['keyspace_misses']
hit_rate = hits / (hits + misses) * 100
print(f"Cache hit rate: {hit_rate:.1f}%")

Cache Keys

Cache entries are namespaced by tenant and context to prevent cross-tenant data leakage: t:{tenant_id}:ai:cache:{embedding_hash}. The embedding hash is a 64-bit xxHash of the normalized query text. TTLs are set per query type: AI Copilot responses (1 hour), incident triage (30 minutes), KB article generation (24 hours).

Tenant Isolation: The tenant_id namespace prefix ensures that a cached response for Tenant A is never served to Tenant B, even if both tenants ask the exact same question. Do not modify the cache key format without thorough security review.

Cache Management

Administrators can manage the semantic cache in AI → Settings → Semantic Cache. Available operations include: view cache statistics, clear cache for a specific query prefix, adjust similarity threshold, and modify TTLs by query type. The cache is automatically evicted using LRU policy when Redis memory usage approaches the configured limit (default: 80% of available memory).

Performance Impact

Metric	Without Cache	With Cache (40% hit rate)
P50 AI response latency	~1.2s	~0.7s
P95 AI response latency	~3.5s	~2.1s
Daily Bedrock API cost	Baseline	~40% reduction
Bedrock API calls/day	Baseline	~40% reduction

← Previous

AI Observability

Usage, cost, and model performance

Exemplar Learning

Teaching AI from resolved cases