Model Router

Routing Architecture

The StackFlow Model Router is a middleware layer between the application and AI providers that intelligently selects the optimal model for each request. Routing decisions consider: task type, required quality level, cost budget remaining, provider health, and current latency. The router runs synchronously within the StackFlowAPI Lambda before each Bedrock API call.

⚙️ Minimum Requirements

DynamoDB: StackFlow_AIModelRouter table with routing rules; PK routerId, GSI on intentType
DynamoDB: StackFlow_AIProvider records referenced by router rules must be status: active
Redis: Router decisions cached under sf:router:{intentHash} with TTL 600s
Lambda Env Var: MODEL_ROUTER_TABLE=StackFlow_AIModelRouter set in StackFlowAPI

Routing Rules

Routing rules are evaluated in priority order. The first matching rule determines the model selection. Rules can match on request metadata (task_type, priority, source_module), context (user role, tenant plan), and system state (budget usage, provider health).

{
  "routing_rules": [
    {
      "priority": 1,
      "name": "P1 Incident Triage - High Quality",
      "condition": {"task_type": "incident_triage", "incident_priority": "P1"},
      "model": "claude-3-sonnet",
      "provider": "bedrock",
      "max_tokens": 4096
    },
    {
      "priority": 2,
      "name": "Classification Tasks - Fast & Cheap",
      "condition": {"task_type": "classification"},
      "model": "claude-3-haiku",
      "provider": "bedrock",
      "max_tokens": 512
    },
    {
      "priority": 3,
      "name": "Complex Analysis - Opus",
      "condition": {"task_type": "rca_analysis", "complexity": "high"},
      "model": "claude-3-opus",
      "provider": "bedrock",
      "max_tokens": 8192
    },
    {
      "priority": 999,
      "name": "Default",
      "condition": {},
      "model": "claude-3-haiku",
      "provider": "bedrock",
      "max_tokens": 2048
    }
  ]
}

Cost Optimization

The model router tracks daily and monthly token spend per tenant. As the daily budget approaches 80%, the router automatically downgrades requests to cheaper models (Haiku instead of Sonnet) while logging the downgrade for review. At 95% daily budget, only critical requests (P1 triage, active copilot sessions) are processed; non-critical tasks are queued.

Tip: Use the Semantic Cache to reduce model router invocations. Cache hits bypass the router entirely, significantly reducing both latency and cost for frequently asked questions.

Latency vs Quality Tradeoffs

Different use cases have different latency tolerances. The AI Copilot requires low latency (<2 seconds for first token) to maintain conversational feel. Background tasks like article generation can tolerate 10-30 seconds. The router uses streaming for interactive use cases and batch processing for background tasks, selecting models accordingly.

Router Metrics

Router metrics are visible in AI → Observability → Model Router. Key metrics include: requests per model, cache hit rate, downgrade frequency, fallback activation count, and cost per task type. These metrics help optimize routing rules and identify opportunities for cost reduction.

← Previous

AI Provider Configuration

Model providers and API keys

Prompt Templates

Managing system and user prompts