v2026.1 Open Portal ↗
On this page

SRE Metrics

DORA Metrics

The SRE Metrics dashboard tracks the four DORA (DevOps Research and Assessment) metrics that are the gold standard for measuring software delivery and operational performance. StackFlow calculates DORA metrics from Change Management records (deployment frequency, lead time) and Incident records (MTTR, change failure rate).

⚙️ Minimum Requirements
  • CloudWatch: Custom namespace StackFlow/DORA with metrics: DeploymentFrequency, LeadTime, MTTR, ChangeFailureRate
  • GitHub Integration: StackFlowGitHubSync Lambda active for deployment frequency tracking
  • Aurora: stackflow.change_records with deployment_outcome field populated for change failure rate
DORA MetricDescriptionStackFlow SourceElite Target
Deployment FrequencyHow often code is deployed to productionSuccessful Change records tagged "deployment"Multiple/day
Lead Time for ChangesTime from commit to productionChange record create → close duration<1 hour
MTTRTime to restore service after incidentIncident create → resolve duration (P1+P2)<1 hour
Change Failure Rate% of deployments causing incidentsChanges with linked P1/P2 incidents within 24h<5%

Error Budget

The error budget tracks how much availability downtime is remaining before the SLO (Service Level Objective) target is breached. Configure SLOs for each critical service in SRE Metrics → SLOs. The error budget gauge shows the remaining budget as a percentage and in absolute minutes. When error budget reaches 0%, all non-critical changes are automatically blocked by the change management workflow.

{
  "slo": {
    "service": "StackFlow API",
    "slo_target_percent": 99.9,
    "measurement_window_days": 30,
    "current_availability_percent": 99.95,
    "error_budget_remaining_minutes": 21.6,
    "error_budget_used_percent": 40
  }
}

Change Failure Rate

Change Failure Rate tracks the percentage of production deployments that result in a degradation or outage requiring a hotfix or rollback. In StackFlow, a change is marked as "failed" when a P1 or P2 incident is linked to it within 24 hours of the change implementation end time. The goal is to keep CFR below 5% for elite performance.

Deployment Frequency

Deployment frequency is tracked via Change records tagged with the "deployment" type. For GitHub-connected teams, the GitHub Sync integration automatically creates Change records from merged PRs to the main branch, providing accurate deployment tracking without manual data entry.

Service Reliability

The service reliability section shows per-service availability percentages based on incident downtime periods linked to each service CI. Services are ranked from least to most reliable, with the 10 most impactful service outages (by user impact × downtime minutes) listed for management attention.