SRE Metrics
DORA Metrics
The SRE Metrics dashboard tracks the four DORA (DevOps Research and Assessment) metrics that are the gold standard for measuring software delivery and operational performance. StackFlow calculates DORA metrics from Change Management records (deployment frequency, lead time) and Incident records (MTTR, change failure rate).
- CloudWatch: Custom namespace
StackFlow/DORAwith metrics:DeploymentFrequency,LeadTime,MTTR,ChangeFailureRate - GitHub Integration:
StackFlowGitHubSyncLambda active for deployment frequency tracking - Aurora:
stackflow.change_recordswithdeployment_outcomefield populated for change failure rate
| DORA Metric | Description | StackFlow Source | Elite Target |
|---|---|---|---|
| Deployment Frequency | How often code is deployed to production | Successful Change records tagged "deployment" | Multiple/day |
| Lead Time for Changes | Time from commit to production | Change record create → close duration | <1 hour |
| MTTR | Time to restore service after incident | Incident create → resolve duration (P1+P2) | <1 hour |
| Change Failure Rate | % of deployments causing incidents | Changes with linked P1/P2 incidents within 24h | <5% |
Error Budget
The error budget tracks how much availability downtime is remaining before the SLO (Service Level Objective) target is breached. Configure SLOs for each critical service in SRE Metrics → SLOs. The error budget gauge shows the remaining budget as a percentage and in absolute minutes. When error budget reaches 0%, all non-critical changes are automatically blocked by the change management workflow.
{
"slo": {
"service": "StackFlow API",
"slo_target_percent": 99.9,
"measurement_window_days": 30,
"current_availability_percent": 99.95,
"error_budget_remaining_minutes": 21.6,
"error_budget_used_percent": 40
}
}
Change Failure Rate
Change Failure Rate tracks the percentage of production deployments that result in a degradation or outage requiring a hotfix or rollback. In StackFlow, a change is marked as "failed" when a P1 or P2 incident is linked to it within 24 hours of the change implementation end time. The goal is to keep CFR below 5% for elite performance.
Deployment Frequency
Deployment frequency is tracked via Change records tagged with the "deployment" type. For GitHub-connected teams, the GitHub Sync integration automatically creates Change records from merged PRs to the main branch, providing accurate deployment tracking without manual data entry.
Service Reliability
The service reliability section shows per-service availability percentages based on incident downtime periods linked to each service CI. Services are ranked from least to most reliable, with the 10 most impactful service outages (by user impact × downtime minutes) listed for management attention.