architecture

routing layer

The TOVA routing engine selects an inference provider for every request by scoring each eligible source against price, performance, and reliability signals.

what the router evaluates

price per input token and per output token
provider uptime over rolling windows
latency to first token and tokens per second
throughput and concurrent capacity
available model inventory and equivalence classes
recent request success rate

default routing factor weights● default policy

Default policy weights. Per-request overrides re-shape the radar at call time.

scoring model

Each eligible provider receives a composite score. The router selects the lowest-scoring provider that satisfies the request's hard constraints (model, max latency, max cost, region).

provider_score =
    price_weight    * normalized_price
  + latency_weight  * normalized_latency
  + error_weight    * recent_error_rate
  + capacity_weight * capacity_penalty
  + uptime_weight   * uptime_penalty

Weights are policy-driven. If the request prioritizes cost, the price weight increases. If the request prioritizes speed, latency and throughput weights increase. Hard constraints filter the provider set before scoring; weights only determine the winner among providers that already qualify.

route policy examples

{
  "route": {
    "objective": "cheapest",
    "max_latency_ms": 1200,
    "fallback": true
  }
}

{
  "route": {
    "objective": "fastest",
    "max_cost_per_million_tokens": 2.50
  }
}

fallback and failover

If the selected provider fails, times out, or returns a rate-limit error before streaming begins, TOVA can retry the request against the next eligible provider. Developers configure fallback behavior per request.

fallback parameters

fallback — enable or disable automatic retry
max_retries — cap on the number of alternate providers attempted
timeout_ms — request timeout before failover is triggered
next-best-provider strategy uses the same scoring model with the failing provider removed
once output streaming has already started, the request cannot be safely failed over and surfaces the partial stream

{
  "route": {
    "objective": "balanced",
    "fallback": true,
    "max_retries": 2,
    "timeout_ms": 15000
  }
}

evolution of the layer

The routing layer is designed to extend over time toward:

permissioned third-party inference suppliers
regional routing and SLA tiers
capacity marketplace dynamics for surplus throughput
dynamic pricing optimization across providers
agent-driven inference budgets and task-level routing

info

The router is policy-driven. Operators can publish custom routing policies — for example EU-only providers, no-train clauses, or strict latency bounds — and TOVA enforces them across every call.

← previous

how it works

api compatibility