v2026.1 Open Portal ↗
On this page

Auto-Remediation

Remediation Architecture

StackFlow's auto-remediation system uses SQS FIFO queues to process remediation actions with strict ordering and exactly-once delivery semantics. When an incident triggers a remediation workflow, the action is enqueued to either StackFlow-Remediation-P1.fifo (for critical priority actions) or StackFlow-Remediation-Standard.fifo. Failed remediations are moved to StackFlow-Remediation-DLQ.fifo for manual review.

⚙️ Minimum Requirements
  • SQS Queues: StackFlow-Remediation-P1.fifo and StackFlow-Remediation-Standard.fifo both provisioned and subscribed to SNS alert topics
  • Lambda: StackFlowRemediationEngine with SQS event source mappings on both queues
  • DynamoDB: StackFlow_RemediationPlaybook with at least one active playbook per supported alert type
  • IAM: Remediation Lambda role must have permissions to execute playbook actions (EC2 reboot, Lambda invoke, RDS failover, etc.)
P1 Incident Created
    │
    ▼
AI Classifier (category: database, confidence: 0.94)
    │
    ▼
Remediation Rule Match? (category=database AND confidence > 0.85)
    │
    ▼
Enqueue to StackFlow-Remediation-P1.fifo
    │
    ▼
StackFlowAPI Lambda (consumer)
    │
    ├─ Check safety controls (maintenance window? recent changes?)
    │
    ├─ Execute: SSM Run Command / Lambda invoke / API call
    │
    └─ Update incident work note with result

P1 Remediation Queue

The P1 queue (StackFlow-Remediation-P1.fifo) has a higher concurrency limit and shorter visibility timeout than the Standard queue, ensuring critical remediations are processed within seconds of being triggered. Messages include the incident ID, remediation action spec, and authentication context for the Lambda executor.

Idempotency: All remediation actions must be designed to be idempotent — safe to run multiple times if retried. Use SQS message deduplication IDs to prevent duplicate execution, and implement idempotency checks at the action level (e.g., check if a service is already running before attempting to start it).

Remediation Actions

ActionDescriptionRequired AWS Permission
SSM Run CommandRun scripts on EC2 instancesssm:SendCommand
Lambda InvokeInvoke a custom remediation Lambdalambda:InvokeFunction
ECS Service UpdateForce new deployment / stop taskecs:UpdateService
RDS RebootReboot an RDS/Aurora instancerds:RebootDBInstance
CloudFront InvalidationInvalidate CDN cachecloudfront:CreateInvalidation
Scale ASGAdjust Auto Scaling Group capacityautoscaling:SetDesiredCapacity

Safety Controls

Auto-remediation safety controls prevent accidental or harmful automated actions. Controls include: maintenance window exclusion (do not remediate if a maintenance window is active for the affected CI), recent-change exclusion (do not remediate if a change was implemented in the last 30 minutes for the affected CI), confidence threshold (minimum AI confidence required to trigger automated action), and dry-run mode (log the action without executing).

Audit Trail

Every auto-remediation action is fully audited. The audit record includes: triggering incident ID, remediation workflow version, action type, target resource, execution result (success/failure/skipped), IAM identity used, CloudTrail event ID (for AWS API actions), and the SSM command run output. Audit records are stored in Aurora for 7 years for compliance purposes.