AI & Bedrock Errors

Error Taxonomy

AI and Bedrock errors in StackFlow fall into three categories: model invocation errors (API-level failures from Bedrock), retrieval errors (RAG pipeline failures when fetching context), and generation quality issues (model returns but response is poor quality). Only the first two categories are system errors; quality issues require prompt template tuning.

⚙️ Minimum Requirements

CloudWatch Logs: AI errors visible in /aws/lambda/StackFlowAPI with filter ERROR.*Bedrock|Neptune|Redis
DynamoDB: stackflow-ai-audit-log table accessible for querying failed AI interactions
Bedrock: Service quotas visible in AWS console; check model invocation limits for anthropic.claude-3-5-sonnet-20241022-v2:0
CloudWatch Alarms: BedrockThrottling alarm active on metric StackFlow/AI:ModelThrottleCount

Model Errors

Symptom	Likely Cause	Diagnostic Step	Resolution
ValidationException from Bedrock	Input exceeds model context window	Log input token count before API call	Reduce context size — limit exemplars to 2, truncate description
AccessDeniedException from Bedrock	Lambda role not granted model access	Check Lambda IAM role for bedrock:InvokeModel permission	Add bedrock:InvokeModel for specific model ARN to Lambda role
ModelNotReadyException	Model is being updated or reloaded by AWS	Wait 60s, check AWS Service Health dashboard	Retry request; configure automatic fallback to alternative model
ResourceNotFoundException for KB	KB ID misconfigured or KB deleted	Verify KB BXJGG7PIPS exists: `aws bedrock-agent get-knowledge-base --knowledge-base-id BXJGG7PIPS`	Correct KB ID in system properties

Throttling & Quotas

Bedrock has model-specific invocation rate limits (Requests Per Minute, RPM). ThrottlingException indicates the RPM limit has been exceeded. Check the AI Observability dashboard for request rate trends, and review the model router's routing rules to ensure expensive models are reserved for complex tasks.

aws cloudwatch get-metric-statistics   --namespace AWS/Bedrock   --metric-name ThrottledRequests   --dimensions Name=ModelId,Value=anthropic.claude-3-sonnet-20240229-v1:0   --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ)   --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)   --period 300 --statistics Sum   --region us-east-1

Quota Increases: Request Bedrock quota increases via the AWS Service Quotas console. Specify your expected peak RPM and the business use case. AWS typically responds within 2-3 business days for Bedrock quota increases.

RAG Failures

Symptom	Likely Cause	Diagnostic Step	Resolution
AI returns "I don't have information" for known articles	KB not synced recently	Check last ingestion job status in Bedrock console	Trigger manual sync via `bedrock-agent start-ingestion-job`
KB returns irrelevant results	Similarity threshold too low	Test retrieval directly using `bedrock-agent-runtime retrieve` API	Increase minimum similarity threshold in RAG config
New articles not appearing in search	S3 sync not triggering Bedrock ingestion	Check S3 event notification on knowledge base bucket	Verify S3 event notification is configured to trigger KB sync

Semantic Cache Issues

redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com   -p 6379 -a "$REDIS_AUTH_TOKEN" --tls   --scan --pattern "t:TENANT_ID:ai:cache:*" | wc -l

redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com   -p 6379 -a "$REDIS_AUTH_TOKEN" --tls   INFO stats | grep keyspace_hits
redis-cli ... INFO stats | grep keyspace_misses

If the cache hit rate is unexpectedly low (<20%), check whether the similarity threshold is set too high (above 0.95), whether TTLs are too short, or whether Redis memory pressure is causing premature eviction. Increase Redis node size or reduce TTLs for less critical cache entries to free memory for AI response caching.

← Previous

Database Issues

Aurora, Neptune, Redis diagnostics

CMDB & Discovery

Discovery failures and sync issues