v2026.1 Open Portal ↗
On this page

Graph RAG Troubleshooting

⚙️ Minimum Requirements
  • CloudWatch Logs: Log groups /aws/lambda/StackFlowAPI, /aws/lambda/StackFlowCacheWarmer, /aws/lambda/StackFlowNeptuneCMDBSeeder with 30-day retention
  • CloudWatch Insights: Query capability on all StackFlow Lambda log groups for RAG diagnostics
  • DynamoDB: stackflow-ai-audit-log table with error attribute indexed for failure queries
  • AWS CLI: v2.x configured with credentials for account 373544523367, region us-east-1
  • VPC Access: Diagnostic Python scripts require bastion host or VPN to reach Neptune (port 8182) and Redis (port 6379)

Common Issues

SymptomLikely CauseDiagnostic StepsResolution
Neptune Gremlin query returns empty resultsGraph not populated or wrong vertex labelRun g.V().count() to check total vertices. Check StackFlowNeptuneCMDBSeeder last run.Trigger seeder: aws lambda invoke --function-name StackFlowNeptuneCMDBSeeder
Neptune connection timeout from LambdaSecurity group or IAM auth not configuredCheck sg-0ada825cda6a75ed6 allows port 8182. Verify IAM auth enabled on cluster. Check Lambda VPC config.Add inbound rule for port 8182 from Lambda SG. Verify StackFlowNeptuneRole is attached.
Bedrock KB returning zero resultsNo documents ingested or ingestion job failedRun aws bedrock-agent list-ingestion-jobs --knowledge-base-id BXJGG7PIPS. Check S3 bucket for documents.Upload documents to s3://stackflow-kb-documents-373544523367/ and trigger new ingestion job.
Bedrock KB retrieve_and_generate throttling (429)Service quota exceededCheck CloudWatch metric StackFlow/AI:ModelThrottleCount. Check Bedrock service quotas in console.Request quota increase via AWS Service Quotas. Implement exponential backoff with 5 retries. Redis cache absorbs repeated queries.
Vector embeddings stale / not reflecting new documentsIngestion job not triggered after S3 uploadCheck aws bedrock-agent list-ingestion-jobs for recent jobs. Verify S3 event trigger on the KB documents bucket.Manually trigger ingestion job. Set up EventBridge S3 notification to auto-trigger StackFlowKBIngestor Lambda.
Redis AUTH failure from LambdaAuth token rotated or env var staleCheck Secrets Manager stackflow/redis/auth-token for last rotation date. Compare with Lambda env var.Redeploy StackFlowAPI Lambda to pick up new secret value. Verify StackFlowAPIRole has secretsmanager:GetSecretValue.
Redis connection timeout (TLS handshake)Security group blocking port 6379 TLSCheck sg-0ada825cda6a75ed6 outbound rules. Verify Redis SG allows inbound 6379 from Lambda SG.Add SG inbound rule: TCP 6379 from Lambda security group sg-0ada825cda6a75ed6.
Semantic cache not warming (CacheWarmer Lambda fails)Aurora ai_outcomes table empty or Redis unreachableCheck /aws/lambda/StackFlowCacheWarmer logs. Verify Aurora has ai_outcomes rows from last 30 days.If Aurora table is empty, disable cache warmer and rely on on-demand caching. Fix Aurora connectivity if Lambda can't reach DB.
Cache hit rate below 10%Query text not being normalized before hashingLog the normalized query text before SHA256. Check if trailing spaces, capitalisation, or punctuation vary between calls.Enforce normalization: query.toLowerCase().trim().replace(/\s+/g, ' ') before computing cache key.
AI answer is confidently wrong (hallucination)KB retrieval returning irrelevant passagesTest retrieval directly with bedrock-agent-runtime:Retrieve. Check passage scores -- if top score <0.4, retrieval quality is poor.Add more specific runbooks to S3 KB bucket. Reduce temperature to 0.1 in prompt template. Add ground-truth exemplars to StackFlow_AIExemplar.
OBO token exchange 401 from AzureClient secret expired or tenant ID mismatchCheck Secrets Manager stackflow/azure-sso/client-secret expiry. Verify AZURE_TENANT_ID matches df4d171f-6cca-4c87-84cd-f299e4fca3a9.Rotate the Azure app client secret in Azure portal and update Secrets Manager. Redeploy stackflow-dev-obo-token-exchange Lambda.
Model router using wrong/expensive modelIntent misclassified; router rule priority conflictEnable AI audit log and check intentType for misrouted queries. Review StackFlow_AIModelRouter for overlapping rules.Add more specific routing rules with higher priority for the affected intent type. Check the intent classification prompt template.
Prompt template not found (PromptTemplateNotFoundError)Template ID missing from StackFlow_PromptTemplate DynamoDBLog the exact templateId from CloudWatch. Query DynamoDB: aws dynamodb get-item --table-name StackFlow_PromptTemplate --key '{"templateId":{"S":"..."}}'Create the missing template in StackFlow_PromptTemplate. Required templates: incident-triage-v1, kb-rag-answer-v1, change-risk-assessment-v1.
Procedural memory not updating after incident resolutionPost-resolution workflow step disabled or Lambda errorCheck StackFlow_Workflow for memory_update step. Verify feature flag ai_learning is enabled. Check Lambda logs for DynamoDB write errors.Enable ai_learning flag. Verify StackFlowAPIRole has dynamodb:PutItem, UpdateItem on StackFlow_ProceduralMemory.
Pattern cluster not matching similar incidentsCluster model stale or too few exemplars per clusterCheck StackFlow_PatternCluster for last updatedAt timestamp. Verify StackFlowPatternClusterer Lambda ran recently.Trigger StackFlowPatternClusterer manually. Add more labelled exemplars to StackFlow_AIExemplar for underrepresented categories.
Exemplar retrieval returning irrelevant examplesExemplar qualityScore threshold too low or wrong intentTypeQuery StackFlow_AIExemplar GSI intentType-index for the affected intent. Check qualityScore distribution.Raise minimum qualityScore filter to 0.7. Remove exemplars with qualityScore <0.5. Add more high-quality exemplars for the intent type.
AI audit log not recording interactionsDynamoDB PutItem permission missing or table TTL misconfiguredCheck CloudWatch for DynamoDB errors in StackFlowAPI logs. Verify table stackflow-ai-audit-log exists and TTL attribute expiresAt is enabled.Add dynamodb:PutItem on stackflow-ai-audit-log to StackFlowAPIRole. Create table if missing.
OpenSearch Serverless collection returning 403Data access policy missing the IAM roleCheck OpenSearch Serverless data access policies in AWS Console. Verify StackFlowBedrockKBRole and StackFlowAPIRole are listed.Update data access policy via aws opensearchserverless update-access-policy to include both roles with aoss:APIAccessAll.
Ingestion job stuck / never completingS3 access denied or OpenSearch OCU capacity limitCheck ingestion job status and failedDocumentCount. Review CloudWatch Logs for Bedrock ingestion errors. Check OpenSearch OCU usage.Fix S3 bucket policy for StackFlowBedrockKBRole. Increase OpenSearch Serverless OCU limit in console. Remove malformed documents from S3.
AI response latency > 8 secondsCold start, Neptune slow query, or Bedrock KB high latencyEnable X-Ray on StackFlowAPI. Check Neptune GremlinRequestsPerSec and CPUUtilization. Check Bedrock InvocationLatency metric.Enable Provisioned Concurrency on StackFlowAPI alias. Optimize Gremlin queries with .limit() and .dedup(). Reduce KB numberOfResults from 10 to 5.

Diagnostic CLI Commands

# Check Neptune cluster status
aws neptune describe-db-clusters   --db-cluster-identifier stackflow-knowledge-graph   --query 'DBClusters[0].{Status:Status,Engine:EngineVersion,MultiAZ:MultiAZ}'   --region us-east-1

# Check Bedrock KB status
aws bedrock-agent get-knowledge-base   --knowledge-base-id BXJGG7PIPS   --query 'knowledgeBase.{Status:status,Updated:updatedAt,Name:name}'   --region us-east-1

# List ingestion jobs (last 5)
aws bedrock-agent list-ingestion-jobs   --knowledge-base-id BXJGG7PIPS   --data-source-id $(aws bedrock-agent list-data-sources --knowledge-base-id BXJGG7PIPS --query 'dataSourceSummaries[0].dataSourceId' --output text)   --query 'ingestionJobSummaries[0:5].{Status:status,Started:startedAt,Stats:statistics}'   --region us-east-1

# Check CacheWarmer Lambda last run
aws logs filter-log-events   --log-group-name /aws/lambda/StackFlowCacheWarmer   --start-time $(date -d '24 hours ago' +%s000)   --filter-pattern '"warmed"'   --query 'events[*].message' --output text   --region us-east-1

# Test StackFlowAPI health (includes Redis and Aurora checks)
aws lambda invoke   --function-name StackFlowAPI   --payload '{"path":"/prod/api/health","httpMethod":"GET","headers":{}}'   --region us-east-1   /tmp/health.json && cat /tmp/health.json

# Check Neptune CPU utilization (last 1 hour)
aws cloudwatch get-metric-statistics   --namespace AWS/Neptune   --metric-name CPUUtilization   --dimensions Name=DBClusterIdentifier,Value=stackflow-knowledge-graph   --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)   --end-time $(date -u +%Y-%m-%dT%H:%M:%S)   --period 300 --statistics Average   --region us-east-1

# Check Redis cache hit ratio in ElastiCache metrics
aws cloudwatch get-metric-statistics   --namespace AWS/ElastiCache   --metric-name CacheHitRate   --dimensions Name=CacheClusterId,Value=stackflow-redis-prod-0001-001   --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)   --end-time $(date -u +%Y-%m-%dT%H:%M:%S)   --period 300 --statistics Average   --region us-east-1

# Check AI audit log for recent errors
aws dynamodb scan   --table-name stackflow-ai-audit-log   --filter-expression 'attribute_exists(#e)'   --expression-attribute-names '{"#e":"error"}'   --projection-expression 'tenantId,userId,query,#e,latencyMs'   --limit 20   --region us-east-1

# Verify OpenSearch Serverless collection is active
aws opensearchserverless batch-get-collection   --ids q3oso7unldm9p4xsqez4   --query 'collectionDetails[0].{Status:status,Endpoint:collectionEndpoint}'   --region us-east-1

# Check Neptune graph vertex count (requires VPC access)
# curl -s -X POST https://stackflow-knowledge-graph.cluster-c6pq0smgmlri.us-east-1.neptune.amazonaws.com:8182/gremlin #   -H "Content-Type: application/json" #   -d '{"gremlin":"g.V().count()"}' | python3 -m json.tool

Log Queries (CloudWatch Insights)

-- AI audit log: find all failed queries in last 24 hours
fields @timestamp, query, error, latencyMs, tenantId
| filter ispresent(error)
| sort @timestamp desc
| limit 50

-- Lambda: RAG-related errors only
fields @timestamp, @message
| filter @message like /Neptune|Bedrock|Redis|RAG|graph-rag/
| filter @message like /ERROR|WARN|timeout|ECONNREFUSED/
| sort @timestamp desc
| limit 100

-- Lambda: slow AI responses (> 5 seconds)
fields @timestamp, @message
| filter @message like /latencyMs/
| parse @message '"latencyMs":*,' as latencyMs
| filter latencyMs > 5000
| sort latencyMs desc
| limit 20

-- Neptune seeder: check sync results
fields @timestamp, @message
| filter @message like /vertices_upserted|edges_upserted|ERROR/
| sort @timestamp desc
| limit 30

Neptune Connectivity Test

#!/usr/bin/env python3
"""
Neptune connectivity test -- run from a Lambda or EC2 instance in the VPC.
Requires: boto3, requests, aws-requests-auth
pip install boto3 requests aws-requests-auth
"""
import boto3
import requests
from aws_requests_auth.boto_utils import BotoAWSRequestsAuth

NEPTUNE_ENDPOINT = 'stackflow-knowledge-graph.cluster-c6pq0smgmlri.us-east-1.neptune.amazonaws.com'
REGION = 'us-east-1'

auth = BotoAWSRequestsAuth(
    aws_host=f'{NEPTUNE_ENDPOINT}:8182',
    aws_region=REGION,
    aws_service='neptune-db'
)

# Test 1: Count total vertices
url = f'https://{NEPTUNE_ENDPOINT}:8182/gremlin'
response = requests.post(url, json={'gremlin': 'g.V().count()'}, auth=auth, timeout=30)
print(f'Status: {response.status_code}')
result = response.json()
count = result['result']['data']['@value'][0]['@value']
print(f'Total vertices in graph: {count}')

# Test 2: List vertex labels
response2 = requests.post(url, json={'gremlin': 'g.V().label().groupCount()'}, auth=auth, timeout=30)
labels = response2.json()['result']['data']['@value'][0]['@value']
print('Vertex labels:', labels)

Bedrock KB Smoke Test

#!/usr/bin/env python3
"""Quick smoke test for Bedrock KB BXJGG7PIPS"""
import boto3
import json

bedrock = boto3.client('bedrock-agent-runtime', region_name='us-east-1')

def smoke_test():
    test_queries = [
        "How do I resolve high Aurora connection count?",
        "What is the Neptune cluster endpoint?",
        "How do I reset a Cognito user password?"
    ]
    for q in test_queries:
        print(f"
Query: {q}")
        try:
            response = bedrock.retrieve(
                knowledgeBaseId='BXJGG7PIPS',
                retrievalQuery={'text': q},
                retrievalConfiguration={
                    'vectorSearchConfiguration': {'numberOfResults': 3}
                }
            )
            results = response['retrievalResults']
            print(f"  Results: {len(results)} passages")
            if results:
                top = results[0]
                print(f"  Top score: {top['score']:.3f}")
                print(f"  Source: {top['location'].get('s3Location', {}).get('uri', 'N/A')}")
                print(f"  Preview: {top['content']['text'][:100]}...")
            else:
                print("  WARNING: No results returned -- KB may be empty")
        except Exception as e:
            print(f"  ERROR: {e}")

smoke_test()