Graph RAG Troubleshooting
⚙️ Minimum Requirements
- CloudWatch Logs: Log groups
/aws/lambda/StackFlowAPI,/aws/lambda/StackFlowCacheWarmer,/aws/lambda/StackFlowNeptuneCMDBSeederwith 30-day retention - CloudWatch Insights: Query capability on all StackFlow Lambda log groups for RAG diagnostics
- DynamoDB:
stackflow-ai-audit-logtable witherrorattribute indexed for failure queries - AWS CLI: v2.x configured with credentials for account
373544523367, regionus-east-1 - VPC Access: Diagnostic Python scripts require bastion host or VPN to reach Neptune (port 8182) and Redis (port 6379)
Common Issues
| Symptom | Likely Cause | Diagnostic Steps | Resolution |
|---|---|---|---|
| Neptune Gremlin query returns empty results | Graph not populated or wrong vertex label | Run g.V().count() to check total vertices. Check StackFlowNeptuneCMDBSeeder last run. | Trigger seeder: aws lambda invoke --function-name StackFlowNeptuneCMDBSeeder |
| Neptune connection timeout from Lambda | Security group or IAM auth not configured | Check sg-0ada825cda6a75ed6 allows port 8182. Verify IAM auth enabled on cluster. Check Lambda VPC config. | Add inbound rule for port 8182 from Lambda SG. Verify StackFlowNeptuneRole is attached. |
| Bedrock KB returning zero results | No documents ingested or ingestion job failed | Run aws bedrock-agent list-ingestion-jobs --knowledge-base-id BXJGG7PIPS. Check S3 bucket for documents. | Upload documents to s3://stackflow-kb-documents-373544523367/ and trigger new ingestion job. |
| Bedrock KB retrieve_and_generate throttling (429) | Service quota exceeded | Check CloudWatch metric StackFlow/AI:ModelThrottleCount. Check Bedrock service quotas in console. | Request quota increase via AWS Service Quotas. Implement exponential backoff with 5 retries. Redis cache absorbs repeated queries. |
| Vector embeddings stale / not reflecting new documents | Ingestion job not triggered after S3 upload | Check aws bedrock-agent list-ingestion-jobs for recent jobs. Verify S3 event trigger on the KB documents bucket. | Manually trigger ingestion job. Set up EventBridge S3 notification to auto-trigger StackFlowKBIngestor Lambda. |
| Redis AUTH failure from Lambda | Auth token rotated or env var stale | Check Secrets Manager stackflow/redis/auth-token for last rotation date. Compare with Lambda env var. | Redeploy StackFlowAPI Lambda to pick up new secret value. Verify StackFlowAPIRole has secretsmanager:GetSecretValue. |
| Redis connection timeout (TLS handshake) | Security group blocking port 6379 TLS | Check sg-0ada825cda6a75ed6 outbound rules. Verify Redis SG allows inbound 6379 from Lambda SG. | Add SG inbound rule: TCP 6379 from Lambda security group sg-0ada825cda6a75ed6. |
| Semantic cache not warming (CacheWarmer Lambda fails) | Aurora ai_outcomes table empty or Redis unreachable | Check /aws/lambda/StackFlowCacheWarmer logs. Verify Aurora has ai_outcomes rows from last 30 days. | If Aurora table is empty, disable cache warmer and rely on on-demand caching. Fix Aurora connectivity if Lambda can't reach DB. |
| Cache hit rate below 10% | Query text not being normalized before hashing | Log the normalized query text before SHA256. Check if trailing spaces, capitalisation, or punctuation vary between calls. | Enforce normalization: query.toLowerCase().trim().replace(/\s+/g, ' ') before computing cache key. |
| AI answer is confidently wrong (hallucination) | KB retrieval returning irrelevant passages | Test retrieval directly with bedrock-agent-runtime:Retrieve. Check passage scores -- if top score <0.4, retrieval quality is poor. | Add more specific runbooks to S3 KB bucket. Reduce temperature to 0.1 in prompt template. Add ground-truth exemplars to StackFlow_AIExemplar. |
| OBO token exchange 401 from Azure | Client secret expired or tenant ID mismatch | Check Secrets Manager stackflow/azure-sso/client-secret expiry. Verify AZURE_TENANT_ID matches df4d171f-6cca-4c87-84cd-f299e4fca3a9. | Rotate the Azure app client secret in Azure portal and update Secrets Manager. Redeploy stackflow-dev-obo-token-exchange Lambda. |
| Model router using wrong/expensive model | Intent misclassified; router rule priority conflict | Enable AI audit log and check intentType for misrouted queries. Review StackFlow_AIModelRouter for overlapping rules. | Add more specific routing rules with higher priority for the affected intent type. Check the intent classification prompt template. |
| Prompt template not found (PromptTemplateNotFoundError) | Template ID missing from StackFlow_PromptTemplate DynamoDB | Log the exact templateId from CloudWatch. Query DynamoDB: aws dynamodb get-item --table-name StackFlow_PromptTemplate --key '{"templateId":{"S":"..."}}' | Create the missing template in StackFlow_PromptTemplate. Required templates: incident-triage-v1, kb-rag-answer-v1, change-risk-assessment-v1. |
| Procedural memory not updating after incident resolution | Post-resolution workflow step disabled or Lambda error | Check StackFlow_Workflow for memory_update step. Verify feature flag ai_learning is enabled. Check Lambda logs for DynamoDB write errors. | Enable ai_learning flag. Verify StackFlowAPIRole has dynamodb:PutItem, UpdateItem on StackFlow_ProceduralMemory. |
| Pattern cluster not matching similar incidents | Cluster model stale or too few exemplars per cluster | Check StackFlow_PatternCluster for last updatedAt timestamp. Verify StackFlowPatternClusterer Lambda ran recently. | Trigger StackFlowPatternClusterer manually. Add more labelled exemplars to StackFlow_AIExemplar for underrepresented categories. |
| Exemplar retrieval returning irrelevant examples | Exemplar qualityScore threshold too low or wrong intentType | Query StackFlow_AIExemplar GSI intentType-index for the affected intent. Check qualityScore distribution. | Raise minimum qualityScore filter to 0.7. Remove exemplars with qualityScore <0.5. Add more high-quality exemplars for the intent type. |
| AI audit log not recording interactions | DynamoDB PutItem permission missing or table TTL misconfigured | Check CloudWatch for DynamoDB errors in StackFlowAPI logs. Verify table stackflow-ai-audit-log exists and TTL attribute expiresAt is enabled. | Add dynamodb:PutItem on stackflow-ai-audit-log to StackFlowAPIRole. Create table if missing. |
| OpenSearch Serverless collection returning 403 | Data access policy missing the IAM role | Check OpenSearch Serverless data access policies in AWS Console. Verify StackFlowBedrockKBRole and StackFlowAPIRole are listed. | Update data access policy via aws opensearchserverless update-access-policy to include both roles with aoss:APIAccessAll. |
| Ingestion job stuck / never completing | S3 access denied or OpenSearch OCU capacity limit | Check ingestion job status and failedDocumentCount. Review CloudWatch Logs for Bedrock ingestion errors. Check OpenSearch OCU usage. | Fix S3 bucket policy for StackFlowBedrockKBRole. Increase OpenSearch Serverless OCU limit in console. Remove malformed documents from S3. |
| AI response latency > 8 seconds | Cold start, Neptune slow query, or Bedrock KB high latency | Enable X-Ray on StackFlowAPI. Check Neptune GremlinRequestsPerSec and CPUUtilization. Check Bedrock InvocationLatency metric. | Enable Provisioned Concurrency on StackFlowAPI alias. Optimize Gremlin queries with .limit() and .dedup(). Reduce KB numberOfResults from 10 to 5. |
Diagnostic CLI Commands
# Check Neptune cluster status
aws neptune describe-db-clusters --db-cluster-identifier stackflow-knowledge-graph --query 'DBClusters[0].{Status:Status,Engine:EngineVersion,MultiAZ:MultiAZ}' --region us-east-1
# Check Bedrock KB status
aws bedrock-agent get-knowledge-base --knowledge-base-id BXJGG7PIPS --query 'knowledgeBase.{Status:status,Updated:updatedAt,Name:name}' --region us-east-1
# List ingestion jobs (last 5)
aws bedrock-agent list-ingestion-jobs --knowledge-base-id BXJGG7PIPS --data-source-id $(aws bedrock-agent list-data-sources --knowledge-base-id BXJGG7PIPS --query 'dataSourceSummaries[0].dataSourceId' --output text) --query 'ingestionJobSummaries[0:5].{Status:status,Started:startedAt,Stats:statistics}' --region us-east-1
# Check CacheWarmer Lambda last run
aws logs filter-log-events --log-group-name /aws/lambda/StackFlowCacheWarmer --start-time $(date -d '24 hours ago' +%s000) --filter-pattern '"warmed"' --query 'events[*].message' --output text --region us-east-1
# Test StackFlowAPI health (includes Redis and Aurora checks)
aws lambda invoke --function-name StackFlowAPI --payload '{"path":"/prod/api/health","httpMethod":"GET","headers":{}}' --region us-east-1 /tmp/health.json && cat /tmp/health.json
# Check Neptune CPU utilization (last 1 hour)
aws cloudwatch get-metric-statistics --namespace AWS/Neptune --metric-name CPUUtilization --dimensions Name=DBClusterIdentifier,Value=stackflow-knowledge-graph --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) --end-time $(date -u +%Y-%m-%dT%H:%M:%S) --period 300 --statistics Average --region us-east-1
# Check Redis cache hit ratio in ElastiCache metrics
aws cloudwatch get-metric-statistics --namespace AWS/ElastiCache --metric-name CacheHitRate --dimensions Name=CacheClusterId,Value=stackflow-redis-prod-0001-001 --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) --end-time $(date -u +%Y-%m-%dT%H:%M:%S) --period 300 --statistics Average --region us-east-1
# Check AI audit log for recent errors
aws dynamodb scan --table-name stackflow-ai-audit-log --filter-expression 'attribute_exists(#e)' --expression-attribute-names '{"#e":"error"}' --projection-expression 'tenantId,userId,query,#e,latencyMs' --limit 20 --region us-east-1
# Verify OpenSearch Serverless collection is active
aws opensearchserverless batch-get-collection --ids q3oso7unldm9p4xsqez4 --query 'collectionDetails[0].{Status:status,Endpoint:collectionEndpoint}' --region us-east-1
# Check Neptune graph vertex count (requires VPC access)
# curl -s -X POST https://stackflow-knowledge-graph.cluster-c6pq0smgmlri.us-east-1.neptune.amazonaws.com:8182/gremlin # -H "Content-Type: application/json" # -d '{"gremlin":"g.V().count()"}' | python3 -m json.tool
Log Queries (CloudWatch Insights)
-- AI audit log: find all failed queries in last 24 hours
fields @timestamp, query, error, latencyMs, tenantId
| filter ispresent(error)
| sort @timestamp desc
| limit 50
-- Lambda: RAG-related errors only
fields @timestamp, @message
| filter @message like /Neptune|Bedrock|Redis|RAG|graph-rag/
| filter @message like /ERROR|WARN|timeout|ECONNREFUSED/
| sort @timestamp desc
| limit 100
-- Lambda: slow AI responses (> 5 seconds)
fields @timestamp, @message
| filter @message like /latencyMs/
| parse @message '"latencyMs":*,' as latencyMs
| filter latencyMs > 5000
| sort latencyMs desc
| limit 20
-- Neptune seeder: check sync results
fields @timestamp, @message
| filter @message like /vertices_upserted|edges_upserted|ERROR/
| sort @timestamp desc
| limit 30
Neptune Connectivity Test
#!/usr/bin/env python3
"""
Neptune connectivity test -- run from a Lambda or EC2 instance in the VPC.
Requires: boto3, requests, aws-requests-auth
pip install boto3 requests aws-requests-auth
"""
import boto3
import requests
from aws_requests_auth.boto_utils import BotoAWSRequestsAuth
NEPTUNE_ENDPOINT = 'stackflow-knowledge-graph.cluster-c6pq0smgmlri.us-east-1.neptune.amazonaws.com'
REGION = 'us-east-1'
auth = BotoAWSRequestsAuth(
aws_host=f'{NEPTUNE_ENDPOINT}:8182',
aws_region=REGION,
aws_service='neptune-db'
)
# Test 1: Count total vertices
url = f'https://{NEPTUNE_ENDPOINT}:8182/gremlin'
response = requests.post(url, json={'gremlin': 'g.V().count()'}, auth=auth, timeout=30)
print(f'Status: {response.status_code}')
result = response.json()
count = result['result']['data']['@value'][0]['@value']
print(f'Total vertices in graph: {count}')
# Test 2: List vertex labels
response2 = requests.post(url, json={'gremlin': 'g.V().label().groupCount()'}, auth=auth, timeout=30)
labels = response2.json()['result']['data']['@value'][0]['@value']
print('Vertex labels:', labels)
Bedrock KB Smoke Test
#!/usr/bin/env python3
"""Quick smoke test for Bedrock KB BXJGG7PIPS"""
import boto3
import json
bedrock = boto3.client('bedrock-agent-runtime', region_name='us-east-1')
def smoke_test():
test_queries = [
"How do I resolve high Aurora connection count?",
"What is the Neptune cluster endpoint?",
"How do I reset a Cognito user password?"
]
for q in test_queries:
print(f"
Query: {q}")
try:
response = bedrock.retrieve(
knowledgeBaseId='BXJGG7PIPS',
retrievalQuery={'text': q},
retrievalConfiguration={
'vectorSearchConfiguration': {'numberOfResults': 3}
}
)
results = response['retrievalResults']
print(f" Results: {len(results)} passages")
if results:
top = results[0]
print(f" Top score: {top['score']:.3f}")
print(f" Source: {top['location'].get('s3Location', {}).get('uri', 'N/A')}")
print(f" Preview: {top['content']['text'][:100]}...")
else:
print(" WARNING: No results returned -- KB may be empty")
except Exception as e:
print(f" ERROR: {e}")
smoke_test()