Skip to main content
A virtual model is a named entry in TrueFoundry AI Gateway that your application calls like any other model (for example my-group/production-chat). Behind that name, you configure one routing strategy and one or more real target models (for example azure/gpt-4o and openai/gpt-4o). The gateway handles load balancing, health-aware routing, retries, and fallbacks automatically so you do not hard-code provider details in every service.

Why route across multiple targets?

Production LLM traffic benefits when the gateway can choose or fail over among more than one backend. Common drivers:
Model providers experience outages and downtimes. For e.g. here’s a screenshot of OpenAI’s and Anthropic’s status page from Feb to May 2025.
OpenAI status page showing multiple incidents and outages from February to May 2025
Anthropic status page showing service disruptions and degraded performance incidents from February to May 2025
To avoid the downtime of your applications when models go down, a lot of organizations use multiple model providers and configure load balancing to route to the healthy model in case one of the models goes down, hence avoiding any downtime of their applications for their users.
Latency and performance varies on time, region, model and provider. Here’s a graph of the latency variance of a few models over a course of a month.
Line graph showing latency variance of different LLM models over time with significant fluctuations between providers
We want to be able to route dynamically to the model with the lowest latency at any point in time.
A lot of the LLM providers enforce strict rate limits on API usage. Here’s a screenshot of Azure OpenAI’s rate limits:
Azure OpenAI service rate limits table showing TPM (tokens per minute) and RPM (requests per minute) quotas for different models
When these limits are exceeded, requests begin to fail and we want to be able to route to other models to keep our application running.
Testing new models or updates in production carries significant risks. Dynamic load balancing can be used to route a small percentage of traffic to the new model and monitor the performance before routing all the traffic to the new model.

Why virtual models?

  • Stable API surface — Your apps pass one model identifier; you change targets, weights, or providers in the gateway without redeploying clients.
  • Resilience — Retries and fallback status codes route around rate limits and transient errors across targets.
  • Governance — Virtual model provider groups support collaborator roles so teams can use or manage routing separately.

Routing strategies

When you create a virtual model, you choose one of three routing strategies. Each strategy uses the same list of targets; the difference is how the gateway picks among healthy targets for each request.
StrategyHow it worksBest for
Weight-basedDistributes traffic by assigned weights (e.g. 80/20). Also supports sticky routing to pin sessions to a target.Canary rollouts, fixed capacity splits, A/B allocation
Priority-basedRoutes to the highest-priority healthy target (0 = highest). Falls back to the next on failure. Supports SLA cutoff.Primary + backup topologies, cost optimization
Latency-basedAutomatically routes to the target with the lowest recent latency. No weights needed.Performance chasing across regions or providers

Weight-based routing

You assign a weight to each target. The gateway distributes incoming requests in proportion to those weights. For example, 90% to azure/gpt-4o and 10% to openai/gpt-4o.

Sticky routing (weight-based only)

Sticky routing pins requests that share the same session key to the same target model for a configurable time window (ttl_seconds). Within that window, every request carrying the same session identifier is routed to the same model. When the window expires, the session is re-evaluated and may land on a different model. Useful for multi-turn conversations, prompt cache efficiency, and consistent user experience.
Add a sticky_routing block inside your weight-based routing configuration. Two fields are required: ttl_seconds and at least one entry in session_identifiers.
routing_config:
  type: weight-based-routing
  sticky_routing:
    ttl_seconds: 3600
    session_identifiers:
      - key: x-user-id
        source: headers
  load_balance_targets:
    - target: provider-a/model-a
      weight: 70
      fallback_candidate: true
    - target: provider-b/model-b
      weight: 30
      fallback_candidate: true
Screenshot2026 03 17at8 53 26PM
Session identifiers tell the gateway which fields to read from the request to identify a session. All configured identifiers are combined into a single session key — you can mix headers and metadata fields.
  • key — The header or metadata field name to read.
  • sourceheaders to read from HTTP request headers, or metadata to read from request metadata.
# Pin by a combination of user and conversation
session_identifiers:
  - key: x-user-id
    source: headers
  - key: x-conversation-id
    source: headers

# Pin using request metadata
session_identifiers:
  - key: tenant-id
    source: metadata
  - key: user-id
    source: metadata
If a configured identifier is missing from the request, it contributes an empty string to the session key. All requests missing that field will be treated as the same session. Make sure your clients always send the identifier fields you configure.
TTL window: ttl_seconds defines how long a session stays pinned. For chatbots, 3600 (1 hour) is a common starting point. For longer workflows, consider 86400 (24 hours).Fallback during a sticky session: If the pinned model fails, the remaining healthy targets are tried in sequence (this fallback order is not weight-based). Only targets with fallback_candidate: true are eligible. After a successful fallback, subsequent requests for that session in the same TTL window are routed to the working target — not back to the one that failed.

Priority-based routing

Each target has a priority number. The gateway routes to the highest priority target (0 is highest) that is healthy. If that target fails or is unavailable, the gateway falls back to the next.
Priority-based routing supports SLA cutoff to automatically mark models as unhealthy when they breach performance thresholds:
  • Configure a Time Per Output Token (TPOT) threshold per target using sla_cutoff.time_per_output_token_ms
  • The gateway monitors average TPOT over a 3-minute rolling window (up to 10 samples, minimum 3 required)
  • If TPOT exceeds the threshold, the target is marked unhealthy and moved to the end of the list
  • Recovery is automatic when metrics improve or older data ages out
SLA cutoff is only available for priority-based routing, not weight-based or latency-based.

Latency-based routing

You do not set weights. The gateway chooses the target with the lowest recent latency automatically.
  1. The time per output token (inter-token latency) is used as the latency metric.
  2. Only requests in the last 20 minutes are considered. If more than 100 requests exist in that window, the last 100 are used. If fewer than 3 requests exist, the model is treated as the fastest so that more traffic is routed to it to gather data.
  3. Models are considered equally fast if their latency is within 1.2× of the fastest — this avoids rapid switching due to minor differences.

Per-target configuration

Regardless of which routing strategy you choose, each target in the virtual model supports several options that control what happens when a request is routed to that target.

Retries and fallbacks

Each target can define how the gateway should handle failures before giving up or moving to another target:
  • Retry configuration — Number of attempts, delay between retries, and which status codes trigger a retry on the same target. Defaults: 2 attempts, 100 ms delay, retry on 429, 500, 502, 503.
  • Fallback status codes — Which status codes cause the gateway to stop retrying this target and try a different target instead. Default: 401, 403, 404, 429, 500, 502, 503.
  • Fallback candidate — Whether this target is eligible to receive traffic when another target fails. Default: true.

Header overrides

You can inject or remove HTTP headers on a per-target basis, applied just before the request is sent to that model. This is useful when a specific target requires headers that the others don’t — for example, a region identifier, a deployment ID, or an API version header expected by one provider but not the rest.
Add a headers_override block to any target in your load_balance_targets list:
load_balance_targets:
  - target: provider-a/model-a
    weight: 80
    headers_override:
      set:
        x-region: us-east-1
      remove:
        - x-internal-debug
  - target: provider-b/model-b
    weight: 20
    headers_override:
      set:
        x-region: eu-west-1
Screenshot2026 03 17at8 56 17PM
  • set — Key-value pairs of headers to add or overwrite on the outgoing request.
  • remove — List of header keys to strip from the outgoing request.
Header keys are case-insensitiveX-Custom-Auth and x-custom-auth refer to the same header. The gateway normalises all keys to lowercase before applying overrides. Header overrides are applied last, after parameter overrides, so they reflect the final outgoing headers sent to the provider.

Model-specific prompt overrides

When a virtual model sends traffic to targets from different model families, you may need different prompt versions per provider. Configure prompt_version_fqn in override parameters on each target. When a request is routed to a target, the gateway uses that target’s prompt version for hydration.
This is useful when:
  • Different models require different prompt formats or structures
  • You want to optimize prompts for specific model capabilities
  • You need to maintain model-specific prompt versions behind one virtual model name
prompt_version_fqn override does not work with agents (when using MCP/tools). It is supported for standard chat completion requests.

Unhealthy target detection

All the routing strategies described above only consider healthy targets when deciding where to send a request. The gateway continuously monitors every target and automatically marks targets as unhealthy when they start failing or breaching performance thresholds. This section explains how that health tracking works. When a target is marked unhealthy, healthy targets are always tried first. Unhealthy targets are moved to the end of the list and only used as a last resort if all healthy targets fail. Recovery is automatic once errors age out of the evaluation window.
The gateway tracks error responses for each target and marks a target unhealthy when failures cross a threshold in a recent time window.
  • Error responses considered: 5xx, 429, 401, and 403
  • Default failure threshold: 2 or more failures
  • Default evaluation window: last 2 minutes (rolling window)
  • Recovery: automatic, once failures age out of the window
For priority-based routing, you can also configure a latency threshold per target using sla_cutoff.time_per_output_token_ms. If the average TPOT over a 3-minute rolling window exceeds the configured threshold (with at least 3 samples), the target is marked unhealthy. See SLA cutoff above.

FAQ

Yes. Updates apply to new requests immediately; in-flight requests keep their current routing.
The gateway returns the actual model used in the x-tfy-resolved-model response header. This may differ from the virtual model you requested due to load balancing or fallbacks. You can also view per-target traffic, success rates, and latency in the AI Gateway dashboard.
Yes, if you enable multiple model types on the virtual model. Every target must support the operation you call.
After retries and fallbacks are exhausted, the request fails with an error. Add enough fallback candidates for critical paths.
No. Allowing virtual models as targets could lead to recursive or deeply nested routing chains that are hard to reason about and debug. To keep routing predictable, targets must be real catalog models. If you need to split traffic further, create separate virtual models and have your client choose between them.
No — but it’s typically not needed. With priority-based routing, requests already go to the same highest-priority healthy target consistently. With latency-based routing, the gateway sticks with the fastest target and only switches when another target becomes measurably faster (within a 1.2× threshold), so oscillation between models is rare. Sticky routing exists for weight-based routing because that strategy deliberately spreads traffic across targets, and you may want to override that for session consistency.
No — only within the configured ttl_seconds window. Once the window expires, the session is re-evaluated and may land on a different model. Size ttl_seconds to match your typical session duration.
Yes. Every gateway pod uses the same configuration and the same time-based window, so all pods independently arrive at the same assignment for the same session. When a fallback occurs mid-window, the update is propagated across all pods automatically.
Header overrides are strictly per-target — they only apply when a request is dispatched to that specific target. Other targets in the same virtual model are not affected.

Next steps

Ready to set up a virtual model? See Create a virtual model for the step-by-step walkthrough and common configuration patterns.