Cloud Fleet Health

Fleet Overview

The Cloud Fleet Health dashboard provides a real-time view of infrastructure health across all connected cloud accounts. Resources are organized by account, region, and service type. Health indicators use a traffic-light system (green/yellow/red) based on configurable threshold rules applied to CloudWatch, Azure Monitor, and GCP Cloud Monitoring metrics.

⚙️ Minimum Requirements

CloudWatch: Cross-account CloudWatch metrics sharing enabled from connected accounts to account 373544523367
Lambda: StackFlowFleetHealthCollector on 5-minute EventBridge schedule
DynamoDB: StackFlow_FleetHealthSnapshot table for historical health state
SNS: stackflow-fleet-alerts topic for health degradation notifications

The fleet health view is the first place on-call engineers check during a P1 incident investigation. It shows which resources are currently degraded, what metrics triggered the degradation status, and links directly to the relevant cloud console for immediate investigation.

Health Metrics

Resource Type	Key Metrics	Default Thresholds
EC2 / VM	CPU utilization, memory, disk I/O	Warning: 80% / Critical: 95%
RDS / Aurora	CPU, connections, storage, replication lag	Warning: 75% CPU / Critical: 90%
Lambda	Error rate, throttles, duration, concurrent executions	Warning: 1% errors / Critical: 5%
ElastiCache	CPU, memory, connections, evictions	Warning: 70% memory / Critical: 85%
API Gateway	4XX/5XX rate, latency, throttle count	Warning: 1% 5XX / Critical: 5%

Alerting Integration

Fleet health alerts integrate with CloudWatch Alarms (AWS), Azure Monitor Alerts, and GCP Cloud Monitoring alerts. When any of these fire, StackFlow receives the alert via the configured webhook or SNS subscription and creates (or updates) an incident in the ITSM module automatically. The alert payload is stored in the incident description and used by the AI triage engine for initial classification.

Deduplication: StackFlow deduplicates cloud alerts against existing open incidents. If a new alert fires for a CI that already has an open incident with matching category, the alert is added as a work note to the existing incident rather than creating a duplicate.

Drill-Down View

Click any resource in the Fleet Health view to open the resource detail panel. The detail panel shows: current metric values, 24-hour metric trend charts, open incidents for this CI, recent changes, and Neptune graph of dependent services. The panel also provides quick actions: create incident, start remediation workflow, or open in cloud console.

Historical Trends

Historical fleet health data is stored in S3 as Parquet files and queryable via Athena for long-term trend analysis. The Fleet Health dashboard shows 7-day and 30-day trend sparklines for each resource. Full historical data export is available in the Report Studio for custom analysis periods.

← Previous

Cloud Accounts

Connecting AWS, Azure, GCP

Cloud Optimization

Cost and resource right-sizing