v2026.1 Open Portal ↗
On this page

Database Issues

Aurora Diagnostics

Aurora PostgreSQL 16 (main cluster: stackflow-main-prod.cluster-c6pq0smgmlri.us-east-1.rds.amazonaws.com) is the primary data store. Most StackFlow functionality depends on Aurora availability. Check cluster status and recent events first when investigating database issues.

⚙️ Minimum Requirements
  • Aurora: Performance Insights enabled on stackflow-main-prod cluster (7-day free retention)
  • CloudWatch: Aurora metrics: DatabaseConnections, FreeableMemory, ReadLatency, WriteLatency alarm-monitored
  • Neptune: CloudWatch namespace AWS/Neptune with CPUUtilization, GremlinRequestsPerSec monitored
  • Redis: CloudWatch namespace AWS/ElastiCache with CacheHits, CacheMisses, CurrConnections tracked
SymptomLikely CauseDiagnostic StepResolution
All API requests failing with DB errorAurora cluster failover in progressCheck Aurora Events in RDS consoleWait for failover (1-2 min), then verify DNS resolution for cluster endpoint
Slow queries during peak hoursMissing index or lock contentionQuery pg_stat_activity and pg_locksAdd missing index, identify and kill blocking queries
Disk space alarmTable bloat from high-churn dataRun SELECT pg_size_pretty(pg_database_size('stackflow'))Schedule VACUUM ANALYZE, consider Aurora Serverless auto-scaling
Replication lag alertsHeavy write load overwhelming reader replicationCheck ReplicaLag CloudWatch metricRoute read traffic to writer temporarily, review write patterns

Connection Pool Issues

-- Check active connections by state
SELECT state, count(*) 
FROM pg_stat_activity 
WHERE datname = 'stackflow'
GROUP BY state ORDER BY count DESC;

-- Find long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '1 minute'
  AND datname = 'stackflow'
ORDER BY duration DESC;

-- Kill a blocking query
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid = [blocking_pid];
Max Connections: Aurora db.r6g.xlarge supports ~2500 connections. The StackFlowAPI Lambda uses a pool of 10 connections per instance. With Lambda concurrency of 200, that's up to 2000 connections — close to the limit. Monitor the DatabaseConnections CloudWatch metric and set an alarm at 80% of max_connections.

Neptune Issues

SymptomLikely CauseDiagnostic StepResolution
Impact analysis timeoutGraph traversal too deep / no depth limitCheck Gremlin query for missing .times() limitAdd max_depth parameter to impact analysis call
Neptune connection refusedNeptune maintenance window activeCheck Neptune cluster status in AWS consoleWait for maintenance to complete (~5-15 min)
Stale CI data in graphCMDB seeder Lambda not runningCheck StackFlowNeptuneCMDBSeeder Lambda recent invocationsManually invoke seeder or check EventBridge schedule

Redis Issues

redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com   -p 6379 -a "$REDIS_AUTH_TOKEN" --tls   INFO memory | grep used_memory_human

redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com   -p 6379 -a "$REDIS_AUTH_TOKEN" --tls   INFO stats | grep evicted_keys

Slow Query Analysis

Enable Aurora slow query logging via the parameter group: set log_min_duration_statement = 1000 (log queries taking >1 second). Logs are published to CloudWatch under /aws/rds/cluster/stackflow-main-prod/postgresql. Use the pg_stat_statements extension for aggregate slow query analysis across all Lambda instances.

Database Diagnostic Queries

-- Aurora: Check active connection count by state
SELECT state, count(*) as connection_count, 
       max(EXTRACT(EPOCH FROM (now() - query_start))::int) as max_query_age_sec
FROM pg_stat_activity
WHERE datname = 'stackflow'
GROUP BY state
ORDER BY connection_count DESC;

-- Aurora: Find long-running queries (> 30 seconds)
SELECT pid, now() - pg_stat_activity.query_start AS duration,
       state, left(query, 200) as query_preview
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '30 seconds'
  AND state != 'idle'
  AND datname = 'stackflow'
ORDER BY duration DESC;

-- Aurora: Top 10 slow queries (requires pg_stat_statements extension)
SELECT round(total_exec_time::numeric, 2) AS total_ms,
       round(mean_exec_time::numeric, 2) AS avg_ms,
       calls,
       left(query, 150) AS query_preview
FROM pg_stat_statements
WHERE dbid = (SELECT oid FROM pg_database WHERE datname = 'stackflow')
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Aurora: Table bloat check (look for tables needing VACUUM)
SELECT relname AS table_name,
       pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
       n_dead_tup AS dead_rows,
       n_live_tup AS live_rows,
       round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY dead_pct DESC
LIMIT 20;
// Neptune: Gremlin diagnostic queries

// Check total graph size
g.V().count()  // vertex count
g.E().count()  // edge count

// List all vertex labels and counts
g.V().groupCount().by(label())

// Find vertices with no edges (orphan CIs -- may indicate seeder issues)
g.V().where(__.not(bothE())).limit(20).valueMap('ciId', 'name', 'type')

// Check specific CI exists
g.V().has('CI', 'name', 'aurora-main-prod').valueMap()

// Verify DEPENDS_ON edges are populated
g.E().hasLabel('DEPENDS_ON').limit(5).project('from','to').by(outV().values('name')).by(inV().values('name'))
# Redis: Check memory usage and key count by prefix
# (Run from within VPC using redis-cli)
redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com \
  -p 6379 --tls -a $REDIS_AUTH_TOKEN INFO memory | grep used_memory_human

# Count keys by prefix
redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com \
  -p 6379 --tls -a $REDIS_AUTH_TOKEN \
  --scan --pattern 'sf:cache:*' | wc -l

redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com \
  -p 6379 --tls -a $REDIS_AUTH_TOKEN \
  --scan --pattern 'sf:session:*' | wc -l

# Check hit/miss ratio
redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com \
  -p 6379 --tls -a $REDIS_AUTH_TOKEN INFO stats | grep -E 'keyspace_hits|keyspace_misses'