Problem Management

Problem vs Incident

In ITIL terminology, an Incident is a disruption to normal service that needs immediate resolution. A Problem is the underlying cause of one or more incidents. StackFlow maintains separate lifecycle workflows for each, with bidirectional linking that allows you to track which incidents were caused by a known problem.

⚙️ Minimum Requirements

DynamoDB: StackFlow_Problem table and StackFlow_KnownError table (KEDB) provisioned
IAM: StackFlowAPIRole with dynamodb:PutItem, dynamodb:UpdateItem on both tables
Aurora: stackflow.problem_incident_links join table migrated
Bedrock KB: BXJGG7PIPS active for AI-assisted root cause analysis suggestions

Problems are typically created reactively (from recurring incidents) or proactively (through trend analysis). The AI engine can automatically suggest creating a Problem record when it detects three or more incidents with similar root cause indicators within a 72-hour window.

Problem Lifecycle

State	Description
New	Problem identified, root cause unknown
Under Investigation	RCA in progress
Known Error	Root cause identified, workaround or fix documented in KEDB
Fix in Progress	Permanent fix being developed (linked to a Change)
Resolved	Permanent fix deployed, no further incidents expected
Closed	Confirmed resolved after monitoring period

Root Cause Analysis

StackFlow provides a structured RCA workspace accessible from the Problem record. The RCA workspace includes a 5-Whys analysis tool, fishbone (Ishikawa) diagram builder, and a timeline view of all related incidents. The AI Copilot can assist with RCA by analyzing incident descriptions and suggesting potential root causes based on the knowledge base.

curl -X GET   "https://your-instance.stackflow-tech.com/prod/api/problems/PRB0000123/rca"   -H "Authorization: Bearer $TOKEN"

Neptune Integration: The RCA timeline uses the Neptune knowledge graph to display the dependency chain of affected CIs, helping identify whether the problem originated in infrastructure, application, or external dependencies.

Known Error Database

The Known Error Database (KEDB) stores documented problems with their workarounds. When an incident is created that matches a known error, the affected agent is immediately shown the workaround steps, reducing mean time to resolution (MTTR). KEDB entries are surfaced in the AI triage results and the AI Copilot sidebar.

KEDB Field	Description
Error Code	Unique identifier (e.g., KE-DB-001)
Summary	Brief description of the known error
Symptoms	Observable symptoms for matching
Workaround	Steps to restore service without a permanent fix
Fix	Permanent resolution steps (if available)
Linked Problem	Parent PRB record
Linked Change	CHG record implementing the fix (if in progress)

Creating a Problem

curl -X POST https://your-instance.stackflow-tech.com/prod/api/problems   -H "Authorization: Bearer $TOKEN"   -H "Content-Type: application/json"   -d '{
    "short_description": "Aurora connection pool exhaustion under peak load",
    "related_incidents": ["INC0001234", "INC0001189", "INC0001045"],
    "category": "database",
    "assignment_group": "Platform Engineering",
    "priority": "P2"
  }'

← Previous

Incident Management

Create, manage, and resolve incidents

Change Management

CAB, approvals, and deployment