SRE Metrics

DORA Metrics

The SRE Metrics dashboard tracks the four DORA (DevOps Research and Assessment) metrics that are the gold standard for measuring software delivery and operational performance. StackFlow calculates DORA metrics from Change Management records (deployment frequency, lead time) and Incident records (MTTR, change failure rate).

⚙️ Minimum Requirements

CloudWatch: Custom namespace StackFlow/DORA with metrics: DeploymentFrequency, LeadTime, MTTR, ChangeFailureRate
GitHub Integration: StackFlowGitHubSync Lambda active for deployment frequency tracking
Aurora: stackflow.change_records with deployment_outcome field populated for change failure rate

DORA Metric	Description	StackFlow Source	Elite Target
Deployment Frequency	How often code is deployed to production	Successful Change records tagged "deployment"	Multiple/day
Lead Time for Changes	Time from commit to production	Change record create → close duration	<1 hour
MTTR	Time to restore service after incident	Incident create → resolve duration (P1+P2)	<1 hour
Change Failure Rate	% of deployments causing incidents	Changes with linked P1/P2 incidents within 24h	<5%

Error Budget

The error budget tracks how much availability downtime is remaining before the SLO (Service Level Objective) target is breached. Configure SLOs for each critical service in SRE Metrics → SLOs. The error budget gauge shows the remaining budget as a percentage and in absolute minutes. When error budget reaches 0%, all non-critical changes are automatically blocked by the change management workflow.

{
  "slo": {
    "service": "StackFlow API",
    "slo_target_percent": 99.9,
    "measurement_window_days": 30,
    "current_availability_percent": 99.95,
    "error_budget_remaining_minutes": 21.6,
    "error_budget_used_percent": 40
  }
}

Change Failure Rate

Change Failure Rate tracks the percentage of production deployments that result in a degradation or outage requiring a hotfix or rollback. In StackFlow, a change is marked as "failed" when a P1 or P2 incident is linked to it within 24 hours of the change implementation end time. The goal is to keep CFR below 5% for elite performance.

Deployment Frequency

Deployment frequency is tracked via Change records tagged with the "deployment" type. For GitHub-connected teams, the GitHub Sync integration automatically creates Change records from merged PRs to the main branch, providing accurate deployment tracking without manual data entry.

Service Reliability

The service reliability section shows per-service availability percentages based on incident downtime periods linked to each service CI. Services are ranked from least to most reliable, with the 10 most impactful service outages (by user impact × downtime minutes) listed for management attention.