Graph RAG Troubleshooting

⚙️ Minimum Requirements

CloudWatch Logs: Log groups /aws/lambda/StackFlowAPI, /aws/lambda/StackFlowCacheWarmer, /aws/lambda/StackFlowNeptuneCMDBSeeder with 30-day retention
CloudWatch Insights: Query capability on all StackFlow Lambda log groups for RAG diagnostics
DynamoDB: stackflow-ai-audit-log table with error attribute indexed for failure queries
AWS CLI: v2.x configured with credentials for account 373544523367, region us-east-1
VPC Access: Diagnostic Python scripts require bastion host or VPN to reach Neptune (port 8182) and Redis (port 6379)

Common Issues

Symptom	Likely Cause	Diagnostic Steps	Resolution
Neptune Gremlin query returns empty results	Graph not populated or wrong vertex label	Run `g.V().count()` to check total vertices. Check StackFlowNeptuneCMDBSeeder last run.	Trigger seeder: `aws lambda invoke --function-name StackFlowNeptuneCMDBSeeder`
Neptune connection timeout from Lambda	Security group or IAM auth not configured	Check sg-0ada825cda6a75ed6 allows port 8182. Verify IAM auth enabled on cluster. Check Lambda VPC config.	Add inbound rule for port 8182 from Lambda SG. Verify StackFlowNeptuneRole is attached.
Bedrock KB returning zero results	No documents ingested or ingestion job failed	Run `aws bedrock-agent list-ingestion-jobs --knowledge-base-id BXJGG7PIPS`. Check S3 bucket for documents.	Upload documents to s3://stackflow-kb-documents-373544523367/ and trigger new ingestion job.
Bedrock KB retrieve_and_generate throttling (429)	Service quota exceeded	Check CloudWatch metric `StackFlow/AI:ModelThrottleCount`. Check Bedrock service quotas in console.	Request quota increase via AWS Service Quotas. Implement exponential backoff with 5 retries. Redis cache absorbs repeated queries.
Vector embeddings stale / not reflecting new documents	Ingestion job not triggered after S3 upload	Check `aws bedrock-agent list-ingestion-jobs` for recent jobs. Verify S3 event trigger on the KB documents bucket.	Manually trigger ingestion job. Set up EventBridge S3 notification to auto-trigger StackFlowKBIngestor Lambda.
Redis AUTH failure from Lambda	Auth token rotated or env var stale	Check Secrets Manager `stackflow/redis/auth-token` for last rotation date. Compare with Lambda env var.	Redeploy StackFlowAPI Lambda to pick up new secret value. Verify StackFlowAPIRole has `secretsmanager:GetSecretValue`.
Redis connection timeout (TLS handshake)	Security group blocking port 6379 TLS	Check sg-0ada825cda6a75ed6 outbound rules. Verify Redis SG allows inbound 6379 from Lambda SG.	Add SG inbound rule: TCP 6379 from Lambda security group sg-0ada825cda6a75ed6.
Semantic cache not warming (CacheWarmer Lambda fails)	Aurora ai_outcomes table empty or Redis unreachable	Check /aws/lambda/StackFlowCacheWarmer logs. Verify Aurora has ai_outcomes rows from last 30 days.	If Aurora table is empty, disable cache warmer and rely on on-demand caching. Fix Aurora connectivity if Lambda can't reach DB.
Cache hit rate below 10%	Query text not being normalized before hashing	Log the normalized query text before SHA256. Check if trailing spaces, capitalisation, or punctuation vary between calls.	Enforce normalization: `query.toLowerCase().trim().replace(/\s+/g, ' ')` before computing cache key.
AI answer is confidently wrong (hallucination)	KB retrieval returning irrelevant passages	Test retrieval directly with `bedrock-agent-runtime:Retrieve`. Check passage scores -- if top score <0.4, retrieval quality is poor.	Add more specific runbooks to S3 KB bucket. Reduce temperature to 0.1 in prompt template. Add ground-truth exemplars to StackFlow_AIExemplar.
OBO token exchange 401 from Azure	Client secret expired or tenant ID mismatch	Check Secrets Manager `stackflow/azure-sso/client-secret` expiry. Verify AZURE_TENANT_ID matches df4d171f-6cca-4c87-84cd-f299e4fca3a9.	Rotate the Azure app client secret in Azure portal and update Secrets Manager. Redeploy `stackflow-dev-obo-token-exchange` Lambda.
Model router using wrong/expensive model	Intent misclassified; router rule priority conflict	Enable AI audit log and check `intentType` for misrouted queries. Review StackFlow_AIModelRouter for overlapping rules.	Add more specific routing rules with higher priority for the affected intent type. Check the intent classification prompt template.
Prompt template not found (PromptTemplateNotFoundError)	Template ID missing from StackFlow_PromptTemplate DynamoDB	Log the exact templateId from CloudWatch. Query DynamoDB: `aws dynamodb get-item --table-name StackFlow_PromptTemplate --key '{"templateId":{"S":"..."}}'`	Create the missing template in StackFlow_PromptTemplate. Required templates: incident-triage-v1, kb-rag-answer-v1, change-risk-assessment-v1.
Procedural memory not updating after incident resolution	Post-resolution workflow step disabled or Lambda error	Check StackFlow_Workflow for memory_update step. Verify feature flag `ai_learning` is enabled. Check Lambda logs for DynamoDB write errors.	Enable `ai_learning` flag. Verify StackFlowAPIRole has `dynamodb:PutItem`, `UpdateItem` on StackFlow_ProceduralMemory.
Pattern cluster not matching similar incidents	Cluster model stale or too few exemplars per cluster	Check StackFlow_PatternCluster for last `updatedAt` timestamp. Verify StackFlowPatternClusterer Lambda ran recently.	Trigger StackFlowPatternClusterer manually. Add more labelled exemplars to StackFlow_AIExemplar for underrepresented categories.
Exemplar retrieval returning irrelevant examples	Exemplar qualityScore threshold too low or wrong intentType	Query StackFlow_AIExemplar GSI `intentType-index` for the affected intent. Check qualityScore distribution.	Raise minimum qualityScore filter to 0.7. Remove exemplars with qualityScore <0.5. Add more high-quality exemplars for the intent type.
AI audit log not recording interactions	DynamoDB PutItem permission missing or table TTL misconfigured	Check CloudWatch for DynamoDB errors in StackFlowAPI logs. Verify table `stackflow-ai-audit-log` exists and TTL attribute `expiresAt` is enabled.	Add `dynamodb:PutItem` on `stackflow-ai-audit-log` to StackFlowAPIRole. Create table if missing.
OpenSearch Serverless collection returning 403	Data access policy missing the IAM role	Check OpenSearch Serverless data access policies in AWS Console. Verify StackFlowBedrockKBRole and StackFlowAPIRole are listed.	Update data access policy via `aws opensearchserverless update-access-policy` to include both roles with `aoss:APIAccessAll`.
Ingestion job stuck / never completing	S3 access denied or OpenSearch OCU capacity limit	Check ingestion job status and `failedDocumentCount`. Review CloudWatch Logs for Bedrock ingestion errors. Check OpenSearch OCU usage.	Fix S3 bucket policy for StackFlowBedrockKBRole. Increase OpenSearch Serverless OCU limit in console. Remove malformed documents from S3.
AI response latency > 8 seconds	Cold start, Neptune slow query, or Bedrock KB high latency	Enable X-Ray on StackFlowAPI. Check Neptune `GremlinRequestsPerSec` and `CPUUtilization`. Check Bedrock `InvocationLatency` metric.	Enable Provisioned Concurrency on StackFlowAPI alias. Optimize Gremlin queries with .limit() and .dedup(). Reduce KB numberOfResults from 10 to 5.

Diagnostic CLI Commands

# Check Neptune cluster status
aws neptune describe-db-clusters   --db-cluster-identifier stackflow-knowledge-graph   --query 'DBClusters[0].{Status:Status,Engine:EngineVersion,MultiAZ:MultiAZ}'   --region us-east-1

# Check Bedrock KB status
aws bedrock-agent get-knowledge-base   --knowledge-base-id BXJGG7PIPS   --query 'knowledgeBase.{Status:status,Updated:updatedAt,Name:name}'   --region us-east-1

# List ingestion jobs (last 5)
aws bedrock-agent list-ingestion-jobs   --knowledge-base-id BXJGG7PIPS   --data-source-id $(aws bedrock-agent list-data-sources --knowledge-base-id BXJGG7PIPS --query 'dataSourceSummaries[0].dataSourceId' --output text)   --query 'ingestionJobSummaries[0:5].{Status:status,Started:startedAt,Stats:statistics}'   --region us-east-1

# Check CacheWarmer Lambda last run
aws logs filter-log-events   --log-group-name /aws/lambda/StackFlowCacheWarmer   --start-time $(date -d '24 hours ago' +%s000)   --filter-pattern '"warmed"'   --query 'events[*].message' --output text   --region us-east-1

# Test StackFlowAPI health (includes Redis and Aurora checks)
aws lambda invoke   --function-name StackFlowAPI   --payload '{"path":"/prod/api/health","httpMethod":"GET","headers":{}}'   --region us-east-1   /tmp/health.json && cat /tmp/health.json

# Check Neptune CPU utilization (last 1 hour)
aws cloudwatch get-metric-statistics   --namespace AWS/Neptune   --metric-name CPUUtilization   --dimensions Name=DBClusterIdentifier,Value=stackflow-knowledge-graph   --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)   --end-time $(date -u +%Y-%m-%dT%H:%M:%S)   --period 300 --statistics Average   --region us-east-1

# Check Redis cache hit ratio in ElastiCache metrics
aws cloudwatch get-metric-statistics   --namespace AWS/ElastiCache   --metric-name CacheHitRate   --dimensions Name=CacheClusterId,Value=stackflow-redis-prod-0001-001   --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)   --end-time $(date -u +%Y-%m-%dT%H:%M:%S)   --period 300 --statistics Average   --region us-east-1

# Check AI audit log for recent errors
aws dynamodb scan   --table-name stackflow-ai-audit-log   --filter-expression 'attribute_exists(#e)'   --expression-attribute-names '{"#e":"error"}'   --projection-expression 'tenantId,userId,query,#e,latencyMs'   --limit 20   --region us-east-1

# Verify OpenSearch Serverless collection is active
aws opensearchserverless batch-get-collection   --ids q3oso7unldm9p4xsqez4   --query 'collectionDetails[0].{Status:status,Endpoint:collectionEndpoint}'   --region us-east-1

# Check Neptune graph vertex count (requires VPC access)
# curl -s -X POST https://stackflow-knowledge-graph.cluster-c6pq0smgmlri.us-east-1.neptune.amazonaws.com:8182/gremlin #   -H "Content-Type: application/json" #   -d '{"gremlin":"g.V().count()"}' | python3 -m json.tool

Log Queries (CloudWatch Insights)

-- AI audit log: find all failed queries in last 24 hours
fields @timestamp, query, error, latencyMs, tenantId
| filter ispresent(error)
| sort @timestamp desc
| limit 50

-- Lambda: RAG-related errors only
fields @timestamp, @message
| filter @message like /Neptune|Bedrock|Redis|RAG|graph-rag/
| filter @message like /ERROR|WARN|timeout|ECONNREFUSED/
| sort @timestamp desc
| limit 100

-- Lambda: slow AI responses (> 5 seconds)
fields @timestamp, @message
| filter @message like /latencyMs/
| parse @message '"latencyMs":*,' as latencyMs
| filter latencyMs > 5000
| sort latencyMs desc
| limit 20

-- Neptune seeder: check sync results
fields @timestamp, @message
| filter @message like /vertices_upserted|edges_upserted|ERROR/
| sort @timestamp desc
| limit 30

Neptune Connectivity Test

#!/usr/bin/env python3
"""
Neptune connectivity test -- run from a Lambda or EC2 instance in the VPC.
Requires: boto3, requests, aws-requests-auth
pip install boto3 requests aws-requests-auth
"""
import boto3
import requests
from aws_requests_auth.boto_utils import BotoAWSRequestsAuth

NEPTUNE_ENDPOINT = 'stackflow-knowledge-graph.cluster-c6pq0smgmlri.us-east-1.neptune.amazonaws.com'
REGION = 'us-east-1'

auth = BotoAWSRequestsAuth(
    aws_host=f'{NEPTUNE_ENDPOINT}:8182',
    aws_region=REGION,
    aws_service='neptune-db'
)

# Test 1: Count total vertices
url = f'https://{NEPTUNE_ENDPOINT}:8182/gremlin'
response = requests.post(url, json={'gremlin': 'g.V().count()'}, auth=auth, timeout=30)
print(f'Status: {response.status_code}')
result = response.json()
count = result['result']['data']['@value'][0]['@value']
print(f'Total vertices in graph: {count}')

# Test 2: List vertex labels
response2 = requests.post(url, json={'gremlin': 'g.V().label().groupCount()'}, auth=auth, timeout=30)
labels = response2.json()['result']['data']['@value'][0]['@value']
print('Vertex labels:', labels)

Bedrock KB Smoke Test

#!/usr/bin/env python3
"""Quick smoke test for Bedrock KB BXJGG7PIPS"""
import boto3
import json

bedrock = boto3.client('bedrock-agent-runtime', region_name='us-east-1')

def smoke_test():
    test_queries = [
        "How do I resolve high Aurora connection count?",
        "What is the Neptune cluster endpoint?",
        "How do I reset a Cognito user password?"
    ]
    for q in test_queries:
        print(f"
Query: {q}")
        try:
            response = bedrock.retrieve(
                knowledgeBaseId='BXJGG7PIPS',
                retrievalQuery={'text': q},
                retrievalConfiguration={
                    'vectorSearchConfiguration': {'numberOfResults': 3}
                }
            )
            results = response['retrievalResults']
            print(f"  Results: {len(results)} passages")
            if results:
                top = results[0]
                print(f"  Top score: {top['score']:.3f}")
                print(f"  Source: {top['location'].get('s3Location', {}).get('uri', 'N/A')}")
                print(f"  Preview: {top['content']['text'][:100]}...")
            else:
                print("  WARNING: No results returned -- KB may be empty")
        except Exception as e:
            print(f"  ERROR: {e}")

smoke_test()

Implementation examples