AI & Bedrock Errors
Error Taxonomy
AI and Bedrock errors in StackFlow fall into three categories: model invocation errors (API-level failures from Bedrock), retrieval errors (RAG pipeline failures when fetching context), and generation quality issues (model returns but response is poor quality). Only the first two categories are system errors; quality issues require prompt template tuning.
- CloudWatch Logs: AI errors visible in
/aws/lambda/StackFlowAPIwith filterERROR.*Bedrock|Neptune|Redis - DynamoDB:
stackflow-ai-audit-logtable accessible for querying failed AI interactions - Bedrock: Service quotas visible in AWS console; check model invocation limits for
anthropic.claude-3-5-sonnet-20241022-v2:0 - CloudWatch Alarms:
BedrockThrottlingalarm active on metricStackFlow/AI:ModelThrottleCount
Model Errors
| Symptom | Likely Cause | Diagnostic Step | Resolution |
|---|---|---|---|
| ValidationException from Bedrock | Input exceeds model context window | Log input token count before API call | Reduce context size — limit exemplars to 2, truncate description |
| AccessDeniedException from Bedrock | Lambda role not granted model access | Check Lambda IAM role for bedrock:InvokeModel permission | Add bedrock:InvokeModel for specific model ARN to Lambda role |
| ModelNotReadyException | Model is being updated or reloaded by AWS | Wait 60s, check AWS Service Health dashboard | Retry request; configure automatic fallback to alternative model |
| ResourceNotFoundException for KB | KB ID misconfigured or KB deleted | Verify KB BXJGG7PIPS exists: aws bedrock-agent get-knowledge-base --knowledge-base-id BXJGG7PIPS | Correct KB ID in system properties |
Throttling & Quotas
Bedrock has model-specific invocation rate limits (Requests Per Minute, RPM). ThrottlingException indicates the RPM limit has been exceeded. Check the AI Observability dashboard for request rate trends, and review the model router's routing rules to ensure expensive models are reserved for complex tasks.
aws cloudwatch get-metric-statistics --namespace AWS/Bedrock --metric-name ThrottledRequests --dimensions Name=ModelId,Value=anthropic.claude-3-sonnet-20240229-v1:0 --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) --period 300 --statistics Sum --region us-east-1
RAG Failures
| Symptom | Likely Cause | Diagnostic Step | Resolution |
|---|---|---|---|
| AI returns "I don't have information" for known articles | KB not synced recently | Check last ingestion job status in Bedrock console | Trigger manual sync via bedrock-agent start-ingestion-job |
| KB returns irrelevant results | Similarity threshold too low | Test retrieval directly using bedrock-agent-runtime retrieve API | Increase minimum similarity threshold in RAG config |
| New articles not appearing in search | S3 sync not triggering Bedrock ingestion | Check S3 event notification on knowledge base bucket | Verify S3 event notification is configured to trigger KB sync |
Semantic Cache Issues
redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com -p 6379 -a "$REDIS_AUTH_TOKEN" --tls --scan --pattern "t:TENANT_ID:ai:cache:*" | wc -l
redis-cli -h master.stackflow-redis-prod.mnzfvx.use1.cache.amazonaws.com -p 6379 -a "$REDIS_AUTH_TOKEN" --tls INFO stats | grep keyspace_hits
redis-cli ... INFO stats | grep keyspace_misses
If the cache hit rate is unexpectedly low (<20%), check whether the similarity threshold is set too high (above 0.95), whether TTLs are too short, or whether Redis memory pressure is causing premature eviction. Increase Redis node size or reduce TTLs for less critical cache entries to free memory for AI response caching.