Semantic Cache Layer

Cache Architecture

The semantic cache layer sits in front of the full RAG pipeline. Before invoking Neptune, Bedrock, or DynamoDB, the orchestrator computes a deterministic cache key from the normalized query and checks Redis. A cache hit returns the pre-computed answer in under 5ms, eliminating Bedrock API costs and reducing latency by 95% for repeated queries.

⚙️ Minimum Requirements

ElastiCache Redis: stackflow-redis-prod (cache.t4g.micro), TLS port 6379, Multi-AZ, engine Redis 7.x
Secrets Manager: Auth token at stackflow/redis/auth-token accessible to StackFlowAPIRole and StackFlowCacheWarmerRole
Lambda: StackFlowCacheWarmer with EventBridge rule stackflow-cache-warmer-daily at 02:00 UTC
Aurora: ai_outcomes table in stackflow database with query_text, confidence, created_at columns
VPC: Redis and Lambda in same VPC; security group allows port 6379 from Lambda SG sg-0ada825cda6a75ed6

The StackFlowCacheWarmer Lambda runs daily at 02:00 UTC, querying the Aurora ai_outcomes table for the top-500 most-asked queries in the last 30 days. It pre-computes embeddings and answers for all of them, ensuring the cache is warm for the next business day's load.

Cache ROI: In production deployments, semantic caching typically achieves 40–70% hit rates on repeated ITSM queries, reducing Bedrock API costs proportionally. The top-10 queries alone often account for 30% of all AI traffic.

Redis Cluster Details

Property	Value
Cluster ID	`stackflow-redis-prod`
Node Type	`cache.t4g.micro`
Engine	Redis 7.x
Primary Endpoint	`master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com:6379`
Replica Endpoint	`replica.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com:6379`
TLS	Required (in-transit encryption)
Auth Token	Secrets Manager: `stackflow/redis/auth-token`
Multi-AZ	Yes (automatic failover enabled)
Max Memory Policy	`volatile-lru` (evict LRU keys with TTL)

Cache Key Strategy

Key format:   sf:cache:{sha256(lower(trim(query)))}
Value format: JSON { answer, sources, model, ts, ttl, hitCount }

Namespace prefixes:
  sf:cache:         Standard query cache (TTL: 3600s = 1 hour)
  sf:embed:         Pre-computed Titan embeddings (TTL: 86400s = 24 hours)
  sf:session:       AI conversation sessions (TTL: 1800s = 30 minutes)
  sf:ratelimit:     Per-tenant rate limiting counters (TTL: 60s)
  sf:flags:         Feature flag cache (TTL: 300s = 5 minutes)
  sf:dash:          Dashboard widget cache (TTL: 300s)
  sf:router:        Model router decision cache (TTL: 600s)

Example keys:
  sf:cache:a3f5c9d2...   → cached answer for "how do I reset cognito password"
  sf:embed:a3f5c9d2...   → 1024-dim Titan embedding vector (JSON array)
  sf:session:sess_abc123  → conversation history for active copilot session
  sf:ratelimit:tenant_001 → current invocation count for rate limiting

Cache Warmer Lambda

import boto3
import redis
import json
import hashlib
import os
import psycopg2
from typing import List, Tuple

def get_secret(secret_id: str) -> dict:
    sm = boto3.client('secretsmanager', region_name='us-east-1')
    return json.loads(sm.get_secret_value(SecretId=secret_id)['SecretString'])

def lambda_handler(event, context):
    """
    StackFlowCacheWarmer -- Daily pre-computation of top-500 AI queries.
    Triggered by EventBridge rule stackflow-cache-warmer-daily at 02:00 UTC.
    """
    # Connect to Aurora PostgreSQL
    db_creds = get_secret('stackflow/aurora-db-credentials')
    conn = psycopg2.connect(
        host=os.environ['PG_HOST'],
        database=os.environ.get('PG_DATABASE', 'stackflow'),
        user=db_creds['username'],
        password=db_creds['password'],
        port=5432,
        connect_timeout=10
    )

    # Connect to Redis with TLS and auth token
    redis_creds = get_secret('stackflow/redis/auth-token')
    r = redis.Redis(
        host='master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com',
        port=6379,
        password=redis_creds['auth_token'],
        ssl=True,
        ssl_cert_reqs='none',
        decode_responses=True,
        socket_connect_timeout=5,
        socket_timeout=5
    )

    # Fetch top-500 queries by frequency (last 30 days)
    with conn.cursor() as cur:
        cur.execute("""
            SELECT query_text, AVG(confidence) as avg_conf, COUNT(*) as freq
            FROM ai_outcomes
            WHERE created_at > NOW() - INTERVAL '30 days'
              AND query_text IS NOT NULL
              AND LENGTH(query_text) > 5
            GROUP BY query_text
            ORDER BY freq DESC, avg_conf DESC
            LIMIT 500
        """)
        top_queries: List[Tuple] = cur.fetchall()
    conn.close()

    bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')
    warmed = 0
    skipped = 0
    errors = 0

    for query_text, avg_conf, freq in top_queries:
        normalized = query_text.lower().strip()
        cache_key = f"sf:cache:{hashlib.sha256(normalized.encode()).hexdigest()}"
        embed_key = f"sf:embed:{hashlib.sha256(normalized.encode()).hexdigest()}"

        # Skip if already cached with fresh TTL (> 30 min remaining)
        ttl = r.ttl(cache_key)
        if ttl > 1800:
            skipped += 1
            continue

        try:
            # Generate embedding via Titan Embed v2
            embed_response = bedrock.invoke_model(
                modelId='amazon.titan-embed-text-v2:0',
                body=json.dumps({'inputText': normalized, 'dimensions': 1024}),
                contentType='application/json',
                accept='application/json'
            )
            embedding = json.loads(embed_response['body'].read())['embedding']

            # Store embedding (24-hour TTL)
            r.setex(embed_key, 86400, json.dumps(embedding))
            warmed += 1
        except Exception as e:
            errors += 1
            print(f"ERROR warming query '{query_text[:50]}': {e}")
            continue

    return {
        'statusCode': 200,
        'warmed': warmed,
        'skipped': skipped,
        'errors': errors,
        'total_queries': len(top_queries)
    }

Monitoring Cache Hit Rate

Cache performance is tracked via CloudWatch custom metrics in the StackFlow/AI namespace. The StackFlowAPI Lambda publishes CacheHit and CacheMiss counters on every AI request.

# Get cache hit rate for the last hour
aws cloudwatch get-metric-statistics   --namespace StackFlow/AI   --metric-name CacheHitRate   --dimensions Name=TenantId,Value=tenant_001   --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)   --end-time $(date -u +%Y-%m-%dT%H:%M:%S)   --period 300 --statistics Average   --region us-east-1

# Get raw Redis hit/miss stats (from within VPC)
redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com   -p 6379 --tls -a $AUTH_TOKEN INFO stats | grep keyspace

Manually Warming the Cache

# Invoke the CacheWarmer Lambda manually
aws lambda invoke   --function-name StackFlowCacheWarmer   --payload '{"mode": "manual", "limit": 100}'   --region us-east-1   /tmp/warmer-result.json && cat /tmp/warmer-result.json

# Expected output:
# {"statusCode": 200, "warmed": 87, "skipped": 13, "errors": 0, "total_queries": 100}

Cache Invalidation

Cache keys are invalidated automatically when their TTL expires. Manual invalidation is needed when a runbook is updated or a known-incorrect cached answer must be removed.

# Invalidate a specific query's cache entry
# First compute the SHA256 key
python3 -c "import hashlib; q='how do i reset cognito password'; print(hashlib.sha256(q.lower().strip().encode()).hexdigest())"
# Then delete from Redis (from within VPC)
redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com   -p 6379 --tls -a $AUTH_TOKEN   DEL sf:cache:<sha256_hash> sf:embed:<sha256_hash>

# Flush all cache for a specific tenant (use with caution)
redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com   -p 6379 --tls -a $AUTH_TOKEN   --scan --pattern 'sf:cache:*' | xargs redis-cli DEL

Warning: Flushing the entire cache with FLUSHDB will evict all sessions, rate limit counters, and dashboard caches -- not just AI query caches. Always target specific key patterns using SCAN with --pattern.

← Previous

Bedrock Vector Search

KB embeddings and retrieval

Troubleshooting

Diagnosing pipeline failures