my-group/production-chat). Behind that name, you configure one routing strategy and one or more real target models (for example azure/gpt-4o and openai/gpt-4o). The gateway handles load balancing, health-aware routing, retries, and fallbacks automatically so you do not hard-code provider details in every service.
Why route across multiple targets?
Production LLM traffic benefits when the gateway can choose or fail over among more than one backend. Common drivers:Service outages and downtime
Service outages and downtime


Latency variance among models
Latency variance among models

Rate limits of models
Rate limits of models

Canary testing
Canary testing
Why virtual models?
- Stable API surface — Your apps pass one model identifier; you change targets, weights, or providers in the gateway without redeploying clients.
- Resilience — Retries and fallback status codes route around rate limits and transient errors across targets.
- Governance — Virtual model provider groups support collaborator roles so teams can use or manage routing separately.
Routing strategies
When you create a virtual model, you choose one of three routing strategies. Each strategy uses the same list of targets; the difference is how the gateway picks among healthy targets for each request.| Strategy | How it works | Best for |
|---|---|---|
| Weight-based | Distributes traffic by assigned weights (e.g. 80/20). Also supports sticky routing to pin sessions to a target. | Canary rollouts, fixed capacity splits, A/B allocation |
| Priority-based | Routes to the highest-priority healthy target (0 = highest). Falls back to the next on failure. Supports SLA cutoff. | Primary + backup topologies, cost optimization |
| Latency-based | Automatically routes to the target with the lowest recent latency. No weights needed. | Performance chasing across regions or providers |
Weight-based routing
You assign a weight to each target. The gateway distributes incoming requests in proportion to those weights. For example, 90% toazure/gpt-4o and 10% to openai/gpt-4o.
Sticky routing (weight-based only)
Sticky routing pins requests that share the same session key to the same target model for a configurable time window (ttl_seconds). Within that window, every request carrying the same session identifier is routed to the same model. When the window expires, the session is re-evaluated and may land on a different model. Useful for multi-turn conversations, prompt cache efficiency, and consistent user experience.
Configuring sticky routing
Configuring sticky routing
sticky_routing block inside your weight-based routing configuration. Two fields are required: ttl_seconds and at least one entry in session_identifiers.
key— The header or metadata field name to read.source—headersto read from HTTP request headers, ormetadatato read from request metadata.
ttl_seconds defines how long a session stays pinned. For chatbots, 3600 (1 hour) is a common starting point. For longer workflows, consider 86400 (24 hours).Fallback during a sticky session: If the pinned model fails, the remaining healthy targets are tried in sequence (this fallback order is not weight-based). Only targets with fallback_candidate: true are eligible. After a successful fallback, subsequent requests for that session in the same TTL window are routed to the working target — not back to the one that failed.Priority-based routing
Each target has a priority number. The gateway routes to the highest priority target (0 is highest) that is healthy. If that target fails or is unavailable, the gateway falls back to the next.SLA cutoff
SLA cutoff
- Configure a Time Per Output Token (TPOT) threshold per target using
sla_cutoff.time_per_output_token_ms - The gateway monitors average TPOT over a 3-minute rolling window (up to 10 samples, minimum 3 required)
- If TPOT exceeds the threshold, the target is marked unhealthy and moved to the end of the list
- Recovery is automatic when metrics improve or older data ages out
Latency-based routing
You do not set weights. The gateway chooses the target with the lowest recent latency automatically.How latency is measured
How latency is measured
- The time per output token (inter-token latency) is used as the latency metric.
- Only requests in the last 20 minutes are considered. If more than 100 requests exist in that window, the last 100 are used. If fewer than 3 requests exist, the model is treated as the fastest so that more traffic is routed to it to gather data.
- Models are considered equally fast if their latency is within 1.2× of the fastest — this avoids rapid switching due to minor differences.
Per-target configuration
Regardless of which routing strategy you choose, each target in the virtual model supports several options that control what happens when a request is routed to that target.Retries and fallbacks
Each target can define how the gateway should handle failures before giving up or moving to another target:- Retry configuration — Number of attempts, delay between retries, and which status codes trigger a retry on the same target. Defaults: 2 attempts, 100 ms delay, retry on
429,500,502,503. - Fallback status codes — Which status codes cause the gateway to stop retrying this target and try a different target instead. Default:
401,403,404,429,500,502,503. - Fallback candidate — Whether this target is eligible to receive traffic when another target fails. Default:
true.
Header overrides
You can inject or remove HTTP headers on a per-target basis, applied just before the request is sent to that model. This is useful when a specific target requires headers that the others don’t — for example, a region identifier, a deployment ID, or an API version header expected by one provider but not the rest.Configuration and examples
Configuration and examples
headers_override block to any target in your load_balance_targets list:
set— Key-value pairs of headers to add or overwrite on the outgoing request.remove— List of header keys to strip from the outgoing request.
X-Custom-Auth and x-custom-auth refer to the same header. The gateway normalises all keys to lowercase before applying overrides. Header overrides are applied last, after parameter overrides, so they reflect the final outgoing headers sent to the provider.Model-specific prompt overrides
When a virtual model sends traffic to targets from different model families, you may need different prompt versions per provider. Configureprompt_version_fqn in override parameters on each target. When a request is routed to a target, the gateway uses that target’s prompt version for hydration.
When and how to use prompt overrides
When and how to use prompt overrides
- Different models require different prompt formats or structures
- You want to optimize prompts for specific model capabilities
- You need to maintain model-specific prompt versions behind one virtual model name
prompt_version_fqn override does not work with agents (when using MCP/tools). It is supported for standard chat completion requests.Unhealthy target detection
All the routing strategies described above only consider healthy targets when deciding where to send a request. The gateway continuously monitors every target and automatically marks targets as unhealthy when they start failing or breaching performance thresholds. This section explains how that health tracking works. When a target is marked unhealthy, healthy targets are always tried first. Unhealthy targets are moved to the end of the list and only used as a last resort if all healthy targets fail. Recovery is automatic once errors age out of the evaluation window.Failure-based cooldown (all routing types)
Failure-based cooldown (all routing types)
- Error responses considered:
5xx,429,401, and403 - Default failure threshold:
2or more failures - Default evaluation window: last
2minutes (rolling window) - Recovery: automatic, once failures age out of the window
SLA-based cooldown (priority-based routing only)
SLA-based cooldown (priority-based routing only)
sla_cutoff.time_per_output_token_ms. If the average TPOT over a 3-minute rolling window exceeds the configured threshold (with at least 3 samples), the target is marked unhealthy. See SLA cutoff above.FAQ
Can I change the routing strategy after creating a virtual model?
Can I change the routing strategy after creating a virtual model?
How do I know which target handled a request?
How do I know which target handled a request?
x-tfy-resolved-model response header. This may differ from the virtual model you requested due to load balancing or fallbacks. You can also view per-target traffic, success rates, and latency in the AI Gateway dashboard.Can one virtual model support several API types (chat and embedding)?
Can one virtual model support several API types (chat and embedding)?
What if every target fails?
What if every target fails?
Can a virtual model point to another virtual model as a target?
Can a virtual model point to another virtual model as a target?
Can I use sticky routing with latency-based or priority-based routing?
Can I use sticky routing with latency-based or priority-based routing?
Will the same session always go to the same model forever?
Will the same session always go to the same model forever?
ttl_seconds window. Once the window expires, the session is re-evaluated and may land on a different model. Size ttl_seconds to match your typical session duration.Will sticky routing work if my gateway has multiple pods?
Will sticky routing work if my gateway has multiple pods?
Do header overrides affect all targets or just the one I configure them on?
Do header overrides affect all targets or just the one I configure them on?