Semantic Cache Layer
Cache Architecture
The semantic cache layer sits in front of the full RAG pipeline. Before invoking Neptune, Bedrock, or DynamoDB, the orchestrator computes a deterministic cache key from the normalized query and checks Redis. A cache hit returns the pre-computed answer in under 5ms, eliminating Bedrock API costs and reducing latency by 95% for repeated queries.
- ElastiCache Redis:
stackflow-redis-prod(cache.t4g.micro), TLS port 6379, Multi-AZ, engine Redis 7.x - Secrets Manager: Auth token at
stackflow/redis/auth-tokenaccessible toStackFlowAPIRoleandStackFlowCacheWarmerRole - Lambda:
StackFlowCacheWarmerwith EventBridge rulestackflow-cache-warmer-dailyat 02:00 UTC - Aurora:
ai_outcomestable instackflowdatabase withquery_text,confidence,created_atcolumns - VPC: Redis and Lambda in same VPC; security group allows port 6379 from Lambda SG
sg-0ada825cda6a75ed6
The StackFlowCacheWarmer Lambda runs daily at 02:00 UTC, querying the Aurora ai_outcomes table for the top-500 most-asked queries in the last 30 days. It pre-computes embeddings and answers for all of them, ensuring the cache is warm for the next business day's load.
Redis Cluster Details
| Property | Value |
|---|---|
| Cluster ID | stackflow-redis-prod |
| Node Type | cache.t4g.micro |
| Engine | Redis 7.x |
| Primary Endpoint | master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com:6379 |
| Replica Endpoint | replica.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com:6379 |
| TLS | Required (in-transit encryption) |
| Auth Token | Secrets Manager: stackflow/redis/auth-token |
| Multi-AZ | Yes (automatic failover enabled) |
| Max Memory Policy | volatile-lru (evict LRU keys with TTL) |
Cache Key Strategy
Key format: sf:cache:{sha256(lower(trim(query)))}
Value format: JSON { answer, sources, model, ts, ttl, hitCount }
Namespace prefixes:
sf:cache: Standard query cache (TTL: 3600s = 1 hour)
sf:embed: Pre-computed Titan embeddings (TTL: 86400s = 24 hours)
sf:session: AI conversation sessions (TTL: 1800s = 30 minutes)
sf:ratelimit: Per-tenant rate limiting counters (TTL: 60s)
sf:flags: Feature flag cache (TTL: 300s = 5 minutes)
sf:dash: Dashboard widget cache (TTL: 300s)
sf:router: Model router decision cache (TTL: 600s)
Example keys:
sf:cache:a3f5c9d2... → cached answer for "how do I reset cognito password"
sf:embed:a3f5c9d2... → 1024-dim Titan embedding vector (JSON array)
sf:session:sess_abc123 → conversation history for active copilot session
sf:ratelimit:tenant_001 → current invocation count for rate limiting
Cache Warmer Lambda
import boto3
import redis
import json
import hashlib
import os
import psycopg2
from typing import List, Tuple
def get_secret(secret_id: str) -> dict:
sm = boto3.client('secretsmanager', region_name='us-east-1')
return json.loads(sm.get_secret_value(SecretId=secret_id)['SecretString'])
def lambda_handler(event, context):
"""
StackFlowCacheWarmer -- Daily pre-computation of top-500 AI queries.
Triggered by EventBridge rule stackflow-cache-warmer-daily at 02:00 UTC.
"""
# Connect to Aurora PostgreSQL
db_creds = get_secret('stackflow/aurora-db-credentials')
conn = psycopg2.connect(
host=os.environ['PG_HOST'],
database=os.environ.get('PG_DATABASE', 'stackflow'),
user=db_creds['username'],
password=db_creds['password'],
port=5432,
connect_timeout=10
)
# Connect to Redis with TLS and auth token
redis_creds = get_secret('stackflow/redis/auth-token')
r = redis.Redis(
host='master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com',
port=6379,
password=redis_creds['auth_token'],
ssl=True,
ssl_cert_reqs='none',
decode_responses=True,
socket_connect_timeout=5,
socket_timeout=5
)
# Fetch top-500 queries by frequency (last 30 days)
with conn.cursor() as cur:
cur.execute("""
SELECT query_text, AVG(confidence) as avg_conf, COUNT(*) as freq
FROM ai_outcomes
WHERE created_at > NOW() - INTERVAL '30 days'
AND query_text IS NOT NULL
AND LENGTH(query_text) > 5
GROUP BY query_text
ORDER BY freq DESC, avg_conf DESC
LIMIT 500
""")
top_queries: List[Tuple] = cur.fetchall()
conn.close()
bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')
warmed = 0
skipped = 0
errors = 0
for query_text, avg_conf, freq in top_queries:
normalized = query_text.lower().strip()
cache_key = f"sf:cache:{hashlib.sha256(normalized.encode()).hexdigest()}"
embed_key = f"sf:embed:{hashlib.sha256(normalized.encode()).hexdigest()}"
# Skip if already cached with fresh TTL (> 30 min remaining)
ttl = r.ttl(cache_key)
if ttl > 1800:
skipped += 1
continue
try:
# Generate embedding via Titan Embed v2
embed_response = bedrock.invoke_model(
modelId='amazon.titan-embed-text-v2:0',
body=json.dumps({'inputText': normalized, 'dimensions': 1024}),
contentType='application/json',
accept='application/json'
)
embedding = json.loads(embed_response['body'].read())['embedding']
# Store embedding (24-hour TTL)
r.setex(embed_key, 86400, json.dumps(embedding))
warmed += 1
except Exception as e:
errors += 1
print(f"ERROR warming query '{query_text[:50]}': {e}")
continue
return {
'statusCode': 200,
'warmed': warmed,
'skipped': skipped,
'errors': errors,
'total_queries': len(top_queries)
}
Monitoring Cache Hit Rate
Cache performance is tracked via CloudWatch custom metrics in the StackFlow/AI namespace. The StackFlowAPI Lambda publishes CacheHit and CacheMiss counters on every AI request.
# Get cache hit rate for the last hour
aws cloudwatch get-metric-statistics --namespace StackFlow/AI --metric-name CacheHitRate --dimensions Name=TenantId,Value=tenant_001 --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) --end-time $(date -u +%Y-%m-%dT%H:%M:%S) --period 300 --statistics Average --region us-east-1
# Get raw Redis hit/miss stats (from within VPC)
redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com -p 6379 --tls -a $AUTH_TOKEN INFO stats | grep keyspace
Manually Warming the Cache
# Invoke the CacheWarmer Lambda manually
aws lambda invoke --function-name StackFlowCacheWarmer --payload '{"mode": "manual", "limit": 100}' --region us-east-1 /tmp/warmer-result.json && cat /tmp/warmer-result.json
# Expected output:
# {"statusCode": 200, "warmed": 87, "skipped": 13, "errors": 0, "total_queries": 100}
Cache Invalidation
Cache keys are invalidated automatically when their TTL expires. Manual invalidation is needed when a runbook is updated or a known-incorrect cached answer must be removed.
# Invalidate a specific query's cache entry
# First compute the SHA256 key
python3 -c "import hashlib; q='how do i reset cognito password'; print(hashlib.sha256(q.lower().strip().encode()).hexdigest())"
# Then delete from Redis (from within VPC)
redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com -p 6379 --tls -a $AUTH_TOKEN DEL sf:cache:<sha256_hash> sf:embed:<sha256_hash>
# Flush all cache for a specific tenant (use with caution)
redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com -p 6379 --tls -a $AUTH_TOKEN --scan --pattern 'sf:cache:*' | xargs redis-cli DEL
FLUSHDB will evict all sessions, rate limit counters, and dashboard caches -- not just AI query caches. Always target specific key patterns using SCAN with --pattern.