Problem Management
Problem vs Incident
In ITIL terminology, an Incident is a disruption to normal service that needs immediate resolution. A Problem is the underlying cause of one or more incidents. StackFlow maintains separate lifecycle workflows for each, with bidirectional linking that allows you to track which incidents were caused by a known problem.
- DynamoDB:
StackFlow_Problemtable andStackFlow_KnownErrortable (KEDB) provisioned - IAM:
StackFlowAPIRolewithdynamodb:PutItem,dynamodb:UpdateItemon both tables - Aurora:
stackflow.problem_incident_linksjoin table migrated - Bedrock KB:
BXJGG7PIPSactive for AI-assisted root cause analysis suggestions
Problems are typically created reactively (from recurring incidents) or proactively (through trend analysis). The AI engine can automatically suggest creating a Problem record when it detects three or more incidents with similar root cause indicators within a 72-hour window.
Problem Lifecycle
| State | Description |
|---|---|
| New | Problem identified, root cause unknown |
| Under Investigation | RCA in progress |
| Known Error | Root cause identified, workaround or fix documented in KEDB |
| Fix in Progress | Permanent fix being developed (linked to a Change) |
| Resolved | Permanent fix deployed, no further incidents expected |
| Closed | Confirmed resolved after monitoring period |
Root Cause Analysis
StackFlow provides a structured RCA workspace accessible from the Problem record. The RCA workspace includes a 5-Whys analysis tool, fishbone (Ishikawa) diagram builder, and a timeline view of all related incidents. The AI Copilot can assist with RCA by analyzing incident descriptions and suggesting potential root causes based on the knowledge base.
curl -X GET "https://your-instance.stackflow-tech.com/prod/api/problems/PRB0000123/rca" -H "Authorization: Bearer $TOKEN"
Known Error Database
The Known Error Database (KEDB) stores documented problems with their workarounds. When an incident is created that matches a known error, the affected agent is immediately shown the workaround steps, reducing mean time to resolution (MTTR). KEDB entries are surfaced in the AI triage results and the AI Copilot sidebar.
| KEDB Field | Description |
|---|---|
| Error Code | Unique identifier (e.g., KE-DB-001) |
| Summary | Brief description of the known error |
| Symptoms | Observable symptoms for matching |
| Workaround | Steps to restore service without a permanent fix |
| Fix | Permanent resolution steps (if available) |
| Linked Problem | Parent PRB record |
| Linked Change | CHG record implementing the fix (if in progress) |
Creating a Problem
curl -X POST https://your-instance.stackflow-tech.com/prod/api/problems -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" -d '{
"short_description": "Aurora connection pool exhaustion under peak load",
"related_incidents": ["INC0001234", "INC0001189", "INC0001045"],
"category": "database",
"assignment_group": "Platform Engineering",
"priority": "P2"
}'