Database Issues
Aurora Diagnostics
Aurora PostgreSQL 16 (main cluster: stackflow-main-prod.cluster-c6pq0smgmlri.us-east-1.rds.amazonaws.com) is the primary data store. Most StackFlow functionality depends on Aurora availability. Check cluster status and recent events first when investigating database issues.
⚙️ Minimum Requirements
- Aurora: Performance Insights enabled on
stackflow-main-prodcluster (7-day free retention) - CloudWatch: Aurora metrics:
DatabaseConnections,FreeableMemory,ReadLatency,WriteLatencyalarm-monitored - Neptune: CloudWatch namespace
AWS/NeptunewithCPUUtilization,GremlinRequestsPerSecmonitored - Redis: CloudWatch namespace
AWS/ElastiCachewithCacheHits,CacheMisses,CurrConnectionstracked
| Symptom | Likely Cause | Diagnostic Step | Resolution |
|---|---|---|---|
| All API requests failing with DB error | Aurora cluster failover in progress | Check Aurora Events in RDS console | Wait for failover (1-2 min), then verify DNS resolution for cluster endpoint |
| Slow queries during peak hours | Missing index or lock contention | Query pg_stat_activity and pg_locks | Add missing index, identify and kill blocking queries |
| Disk space alarm | Table bloat from high-churn data | Run SELECT pg_size_pretty(pg_database_size('stackflow')) | Schedule VACUUM ANALYZE, consider Aurora Serverless auto-scaling |
| Replication lag alerts | Heavy write load overwhelming reader replication | Check ReplicaLag CloudWatch metric | Route read traffic to writer temporarily, review write patterns |
Connection Pool Issues
-- Check active connections by state
SELECT state, count(*)
FROM pg_stat_activity
WHERE datname = 'stackflow'
GROUP BY state ORDER BY count DESC;
-- Find long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '1 minute'
AND datname = 'stackflow'
ORDER BY duration DESC;
-- Kill a blocking query
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid = [blocking_pid];
Max Connections: Aurora db.r6g.xlarge supports ~2500 connections. The StackFlowAPI Lambda uses a pool of 10 connections per instance. With Lambda concurrency of 200, that's up to 2000 connections — close to the limit. Monitor the DatabaseConnections CloudWatch metric and set an alarm at 80% of max_connections.
Neptune Issues
| Symptom | Likely Cause | Diagnostic Step | Resolution |
|---|---|---|---|
| Impact analysis timeout | Graph traversal too deep / no depth limit | Check Gremlin query for missing .times() limit | Add max_depth parameter to impact analysis call |
| Neptune connection refused | Neptune maintenance window active | Check Neptune cluster status in AWS console | Wait for maintenance to complete (~5-15 min) |
| Stale CI data in graph | CMDB seeder Lambda not running | Check StackFlowNeptuneCMDBSeeder Lambda recent invocations | Manually invoke seeder or check EventBridge schedule |
Redis Issues
redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com -p 6379 -a "$REDIS_AUTH_TOKEN" --tls INFO memory | grep used_memory_human
redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com -p 6379 -a "$REDIS_AUTH_TOKEN" --tls INFO stats | grep evicted_keys
Slow Query Analysis
Enable Aurora slow query logging via the parameter group: set log_min_duration_statement = 1000 (log queries taking >1 second). Logs are published to CloudWatch under /aws/rds/cluster/stackflow-main-prod/postgresql. Use the pg_stat_statements extension for aggregate slow query analysis across all Lambda instances.
Database Diagnostic Queries
-- Aurora: Check active connection count by state
SELECT state, count(*) as connection_count,
max(EXTRACT(EPOCH FROM (now() - query_start))::int) as max_query_age_sec
FROM pg_stat_activity
WHERE datname = 'stackflow'
GROUP BY state
ORDER BY connection_count DESC;
-- Aurora: Find long-running queries (> 30 seconds)
SELECT pid, now() - pg_stat_activity.query_start AS duration,
state, left(query, 200) as query_preview
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '30 seconds'
AND state != 'idle'
AND datname = 'stackflow'
ORDER BY duration DESC;
-- Aurora: Top 10 slow queries (requires pg_stat_statements extension)
SELECT round(total_exec_time::numeric, 2) AS total_ms,
round(mean_exec_time::numeric, 2) AS avg_ms,
calls,
left(query, 150) AS query_preview
FROM pg_stat_statements
WHERE dbid = (SELECT oid FROM pg_database WHERE datname = 'stackflow')
ORDER BY mean_exec_time DESC
LIMIT 10;
-- Aurora: Table bloat check (look for tables needing VACUUM)
SELECT relname AS table_name,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
n_dead_tup AS dead_rows,
n_live_tup AS live_rows,
round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY dead_pct DESC
LIMIT 20;
// Neptune: Gremlin diagnostic queries
// Check total graph size
g.V().count() // vertex count
g.E().count() // edge count
// List all vertex labels and counts
g.V().groupCount().by(label())
// Find vertices with no edges (orphan CIs -- may indicate seeder issues)
g.V().where(__.not(bothE())).limit(20).valueMap('ciId', 'name', 'type')
// Check specific CI exists
g.V().has('CI', 'name', 'aurora-main-prod').valueMap()
// Verify DEPENDS_ON edges are populated
g.E().hasLabel('DEPENDS_ON').limit(5).project('from','to').by(outV().values('name')).by(inV().values('name'))
# Redis: Check memory usage and key count by prefix
# (Run from within VPC using redis-cli)
redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com \
-p 6379 --tls -a $REDIS_AUTH_TOKEN INFO memory | grep used_memory_human
# Count keys by prefix
redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com \
-p 6379 --tls -a $REDIS_AUTH_TOKEN \
--scan --pattern 'sf:cache:*' | wc -l
redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com \
-p 6379 --tls -a $REDIS_AUTH_TOKEN \
--scan --pattern 'sf:session:*' | wc -l
# Check hit/miss ratio
redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com \
-p 6379 --tls -a $REDIS_AUTH_TOKEN INFO stats | grep -E 'keyspace_hits|keyspace_misses'