Incident Management
Incident Lifecycle
StackFlow implements the ITIL v4 incident management process, tracking each incident through a defined lifecycle from initial detection to formal closure. The lifecycle states map directly to SLA pause/resume rules and notification triggers.
- IAM Role:
StackFlowAPIRolemust havedynamodb:PutItem,dynamodb:UpdateItemonStackFlow_Incident - SES: At least one verified sender identity in
us-east-1for incident notification emails - SNS Topic:
stackflow-sla-alertsmust exist with at least one subscription for breach notifications - Assignment Groups: At least one
StackFlow_AssignmentGrouprecord required for auto-routing - SLA Definition: At least one active
StackFlow_SLADefinitionwith priority mappings - AI Triage (optional):
StackFlow_TenantAIConfigmust havetriageEnabled: trueand a validAIProviderrecord
| State | Description | SLA Running |
|---|---|---|
| New | Incident created, not yet reviewed | Yes |
| In Progress | Assigned and being actively worked | Yes |
| Pending Customer | Awaiting customer information | Paused (if configured) |
| Pending Vendor | Escalated to third-party vendor | Yes |
| Resolved | Fix applied, awaiting confirmation | No |
| Closed | Confirmed resolved, no further action | No |
| Cancelled | Determined to be not a valid incident | No |
Creating Incidents
Incidents can be created via the portal UI, REST API, email-to-ticket, or automated alerting integrations. Every new incident triggers the AI classification engine which suggests category, subcategory, assignment group, and priority based on the short description and body text.
curl -X POST https://your-instance.stackflow-tech.com/prod/api/incidents -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" -d '{
"short_description": "Production Aurora database connection timeout",
"description": "Users are unable to log in. Seeing connection timeout errors in Lambda logs.",
"priority": "P1",
"category": "database",
"subcategory": "availability",
"affected_ci": "ci_aurora_main_prod",
"reporter_id": "usr_abc123"
}'
Priority Matrix
Incident priority is determined by the impact (number of users/services affected) and urgency (time sensitivity). The priority matrix below maps these two dimensions to P1–P4.
| Impact \ Urgency | Critical | High | Medium | Low |
|---|---|---|---|---|
| Enterprise-wide | P1 | P1 | P2 | P2 |
| Multiple departments | P1 | P2 | P2 | P3 |
| Single department | P2 | P2 | P3 | P3 |
| Single user | P2 | P3 | P3 | P4 |
AI-Assisted Triage
The AI triage engine analyzes the incident's text content against the Bedrock Knowledge Base (BXJGG7PIPS) to suggest a resolution path. For common issues, it can attach a resolution note directly from the knowledge base. For novel issues, it provides three ranked suggested solutions with confidence scores.
{
"ai_triage_result": {
"category_suggestion": "database",
"subcategory_suggestion": "connectivity",
"priority_suggestion": "P1",
"assignment_group_suggestion": "Platform Engineering",
"confidence": 0.92,
"kb_articles": [
{"id": "KB0001234", "title": "Aurora Connection Pool Exhaustion", "relevance": 0.95},
{"id": "KB0001156", "title": "Lambda to Aurora VPC Connectivity", "relevance": 0.88}
],
"suggested_resolution": "Check Aurora connection pool utilization. Current max_connections for db.r6g.xlarge is 2500. Run: SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
}
}
Resolving Incidents
When an incident is resolved, the resolving agent enters a resolution note, root cause category, and time spent. StackFlow automatically links the resolved incident to any related Problem records and updates the KEDB if a known error entry was involved. The resolution note is fed back into the Bedrock Knowledge Base for future RAG retrieval.
After resolution, the affected user receives an email notification with a satisfaction survey. Survey responses contribute to the AI model's feedback loop through the Exemplar Learning system, improving future resolution suggestions for similar incidents.
API Examples
# Create an incident via REST API
curl -X POST https://demo2.stackflow-tech.com/prod/api/incidents \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"short_description": "Aurora RDS connection pool exhausted in prod",
"priority": "P1",
"category": "database",
"subcategory": "aurora",
"assignment_group": "Platform Engineering",
"description": "Lambda functions receiving ECONNREFUSED from Aurora. pg_stat_activity shows 900+ connections.",
"cmdb_ci": "ci-aurora-main-prod",
"notify_on_create": true
}'
# Expected: {"incidentId": "INC0023456", "number": "INC0023456", "state": "open", "priority": "P1", "slaDeadline": "2026-05-18T03:15:00Z"}
#!/usr/bin/env python3
"""Bulk-close stale incidents that have been open more than 30 days with no updates."""
import boto3
from datetime import datetime, timedelta, timezone
ddb = boto3.resource('dynamodb', region_name='us-east-1')
table = ddb.Table('StackFlow_Incident')
cutoff = (datetime.now(timezone.utc) - timedelta(days=30)).isoformat()
response = table.scan(
FilterExpression='#s IN (:open, :in_progress) AND updatedAt < :cutoff',
ExpressionAttributeNames={'#s': 'state'},
ExpressionAttributeValues={
':open': 'open', ':in_progress': 'in_progress', ':cutoff': cutoff
}
)
stale = response['Items']
print(f"Found {len(stale)} stale incidents to close")
for incident in stale:
table.update_item(
Key={'incidentId': incident['incidentId']},
UpdateExpression='SET #s = :closed, closedAt = :now, closeCode = :auto',
ExpressionAttributeNames={'#s': 'state'},
ExpressionAttributeValues={
':closed': 'closed',
':now': datetime.now(timezone.utc).isoformat(),
':auto': 'auto_closed_stale'
}
)
print(f" Closed: {incident['incidentId']} ({incident.get('short_description','')[:60]})")
print(f"Done. {len(stale)} incidents closed.")
# Query DynamoDB for open P1 incidents
aws dynamodb scan \
--table-name StackFlow_Incident \
--filter-expression '#s = :open AND priority = :p1' \
--expression-attribute-names '{"#s": "state"}' \
--expression-attribute-values '{":open": {"S": "open"}, ":p1": {"S": "P1"}}' \
--projection-expression 'incidentId, short_description, createdAt, assignedTo, slaDeadline' \
--region us-east-1 | python3 -c "
import json, sys
data = json.load(sys.stdin)
for i in sorted(data['Items'], key=lambda x: x['createdAt']['S']):
print(f" {i['incidentId']['S']}: {i.get('short_description',{}).get('S','N/A')[:60]}")
print(f'\nTotal open P1s: {data[\"Count\"]}')"