v2026.1 Open Portal ↗
On this page

Cloud Fleet Health

Fleet Overview

The Cloud Fleet Health dashboard provides a real-time view of infrastructure health across all connected cloud accounts. Resources are organized by account, region, and service type. Health indicators use a traffic-light system (green/yellow/red) based on configurable threshold rules applied to CloudWatch, Azure Monitor, and GCP Cloud Monitoring metrics.

⚙️ Minimum Requirements
  • CloudWatch: Cross-account CloudWatch metrics sharing enabled from connected accounts to account 373544523367
  • Lambda: StackFlowFleetHealthCollector on 5-minute EventBridge schedule
  • DynamoDB: StackFlow_FleetHealthSnapshot table for historical health state
  • SNS: stackflow-fleet-alerts topic for health degradation notifications

The fleet health view is the first place on-call engineers check during a P1 incident investigation. It shows which resources are currently degraded, what metrics triggered the degradation status, and links directly to the relevant cloud console for immediate investigation.

Health Metrics

Resource TypeKey MetricsDefault Thresholds
EC2 / VMCPU utilization, memory, disk I/OWarning: 80% / Critical: 95%
RDS / AuroraCPU, connections, storage, replication lagWarning: 75% CPU / Critical: 90%
LambdaError rate, throttles, duration, concurrent executionsWarning: 1% errors / Critical: 5%
ElastiCacheCPU, memory, connections, evictionsWarning: 70% memory / Critical: 85%
API Gateway4XX/5XX rate, latency, throttle countWarning: 1% 5XX / Critical: 5%

Alerting Integration

Fleet health alerts integrate with CloudWatch Alarms (AWS), Azure Monitor Alerts, and GCP Cloud Monitoring alerts. When any of these fire, StackFlow receives the alert via the configured webhook or SNS subscription and creates (or updates) an incident in the ITSM module automatically. The alert payload is stored in the incident description and used by the AI triage engine for initial classification.

Deduplication: StackFlow deduplicates cloud alerts against existing open incidents. If a new alert fires for a CI that already has an open incident with matching category, the alert is added as a work note to the existing incident rather than creating a duplicate.

Drill-Down View

Click any resource in the Fleet Health view to open the resource detail panel. The detail panel shows: current metric values, 24-hour metric trend charts, open incidents for this CI, recent changes, and Neptune graph of dependent services. The panel also provides quick actions: create incident, start remediation workflow, or open in cloud console.

Historical fleet health data is stored in S3 as Parquet files and queryable via Athena for long-term trend analysis. The Fleet Health dashboard shows 7-day and 30-day trend sparklines for each resource. Full historical data export is available in the Report Studio for custom analysis periods.