v2026.1 Open Portal ↗
On this page

Incident Management

Incident Lifecycle

StackFlow implements the ITIL v4 incident management process, tracking each incident through a defined lifecycle from initial detection to formal closure. The lifecycle states map directly to SLA pause/resume rules and notification triggers.

⚙️ Minimum Requirements
  • IAM Role: StackFlowAPIRole must have dynamodb:PutItem, dynamodb:UpdateItem on StackFlow_Incident
  • SES: At least one verified sender identity in us-east-1 for incident notification emails
  • SNS Topic: stackflow-sla-alerts must exist with at least one subscription for breach notifications
  • Assignment Groups: At least one StackFlow_AssignmentGroup record required for auto-routing
  • SLA Definition: At least one active StackFlow_SLADefinition with priority mappings
  • AI Triage (optional): StackFlow_TenantAIConfig must have triageEnabled: true and a valid AIProvider record
StateDescriptionSLA Running
NewIncident created, not yet reviewedYes
In ProgressAssigned and being actively workedYes
Pending CustomerAwaiting customer informationPaused (if configured)
Pending VendorEscalated to third-party vendorYes
ResolvedFix applied, awaiting confirmationNo
ClosedConfirmed resolved, no further actionNo
CancelledDetermined to be not a valid incidentNo

Creating Incidents

Incidents can be created via the portal UI, REST API, email-to-ticket, or automated alerting integrations. Every new incident triggers the AI classification engine which suggests category, subcategory, assignment group, and priority based on the short description and body text.

curl -X POST https://your-instance.stackflow-tech.com/prod/api/incidents   -H "Authorization: Bearer $TOKEN"   -H "Content-Type: application/json"   -d '{
    "short_description": "Production Aurora database connection timeout",
    "description": "Users are unable to log in. Seeing connection timeout errors in Lambda logs.",
    "priority": "P1",
    "category": "database",
    "subcategory": "availability",
    "affected_ci": "ci_aurora_main_prod",
    "reporter_id": "usr_abc123"
  }'

Priority Matrix

Incident priority is determined by the impact (number of users/services affected) and urgency (time sensitivity). The priority matrix below maps these two dimensions to P1–P4.

Impact \ UrgencyCriticalHighMediumLow
Enterprise-wideP1P1P2P2
Multiple departmentsP1P2P2P3
Single departmentP2P2P3P3
Single userP2P3P3P4
P1 Auto-Escalation: All P1 incidents automatically trigger the Major Incident Management workflow if not acknowledged within 15 minutes of creation. The on-call engineer receives a PagerDuty-style notification via the configured escalation policy.

AI-Assisted Triage

The AI triage engine analyzes the incident's text content against the Bedrock Knowledge Base (BXJGG7PIPS) to suggest a resolution path. For common issues, it can attach a resolution note directly from the knowledge base. For novel issues, it provides three ranked suggested solutions with confidence scores.

{
  "ai_triage_result": {
    "category_suggestion": "database",
    "subcategory_suggestion": "connectivity",
    "priority_suggestion": "P1",
    "assignment_group_suggestion": "Platform Engineering",
    "confidence": 0.92,
    "kb_articles": [
      {"id": "KB0001234", "title": "Aurora Connection Pool Exhaustion", "relevance": 0.95},
      {"id": "KB0001156", "title": "Lambda to Aurora VPC Connectivity", "relevance": 0.88}
    ],
    "suggested_resolution": "Check Aurora connection pool utilization. Current max_connections for db.r6g.xlarge is 2500. Run: SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
  }
}

Resolving Incidents

When an incident is resolved, the resolving agent enters a resolution note, root cause category, and time spent. StackFlow automatically links the resolved incident to any related Problem records and updates the KEDB if a known error entry was involved. The resolution note is fed back into the Bedrock Knowledge Base for future RAG retrieval.

After resolution, the affected user receives an email notification with a satisfaction survey. Survey responses contribute to the AI model's feedback loop through the Exemplar Learning system, improving future resolution suggestions for similar incidents.

API Examples

# Create an incident via REST API
curl -X POST https://demo2.stackflow-tech.com/prod/api/incidents \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "short_description": "Aurora RDS connection pool exhausted in prod",
    "priority": "P1",
    "category": "database",
    "subcategory": "aurora",
    "assignment_group": "Platform Engineering",
    "description": "Lambda functions receiving ECONNREFUSED from Aurora. pg_stat_activity shows 900+ connections.",
    "cmdb_ci": "ci-aurora-main-prod",
    "notify_on_create": true
  }'

# Expected: {"incidentId": "INC0023456", "number": "INC0023456", "state": "open", "priority": "P1", "slaDeadline": "2026-05-18T03:15:00Z"}
#!/usr/bin/env python3
"""Bulk-close stale incidents that have been open more than 30 days with no updates."""
import boto3
from datetime import datetime, timedelta, timezone

ddb = boto3.resource('dynamodb', region_name='us-east-1')
table = ddb.Table('StackFlow_Incident')

cutoff = (datetime.now(timezone.utc) - timedelta(days=30)).isoformat()

response = table.scan(
    FilterExpression='#s IN (:open, :in_progress) AND updatedAt < :cutoff',
    ExpressionAttributeNames={'#s': 'state'},
    ExpressionAttributeValues={
        ':open': 'open', ':in_progress': 'in_progress', ':cutoff': cutoff
    }
)
stale = response['Items']
print(f"Found {len(stale)} stale incidents to close")

for incident in stale:
    table.update_item(
        Key={'incidentId': incident['incidentId']},
        UpdateExpression='SET #s = :closed, closedAt = :now, closeCode = :auto',
        ExpressionAttributeNames={'#s': 'state'},
        ExpressionAttributeValues={
            ':closed': 'closed',
            ':now': datetime.now(timezone.utc).isoformat(),
            ':auto': 'auto_closed_stale'
        }
    )
    print(f"  Closed: {incident['incidentId']} ({incident.get('short_description','')[:60]})")

print(f"Done. {len(stale)} incidents closed.")
# Query DynamoDB for open P1 incidents
aws dynamodb scan \
  --table-name StackFlow_Incident \
  --filter-expression '#s = :open AND priority = :p1' \
  --expression-attribute-names '{"#s": "state"}' \
  --expression-attribute-values '{":open": {"S": "open"}, ":p1": {"S": "P1"}}' \
  --projection-expression 'incidentId, short_description, createdAt, assignedTo, slaDeadline' \
  --region us-east-1 | python3 -c "
import json, sys
data = json.load(sys.stdin)
for i in sorted(data['Items'], key=lambda x: x['createdAt']['S']):
    print(f"  {i['incidentId']['S']}: {i.get('short_description',{}).get('S','N/A')[:60]}")
print(f'\nTotal open P1s: {data[\"Count\"]}')"