A lot of services and models that are deployed are not used all the time. A very common pattern is dev services not being used during the weekends or off-office hours. While Truefoundry provides a pause/resume service feature, it relies on developers to manually pause the service. This can lead to human errors and cost leakage. Another option to automatically shut down a service if there are no requests for a period and then automatically start it when a request is received is to use the scale to 0 feature. You can configure that the service will shut down once there are no requests for let’s say 10 minutes. After 10 mins of no requests, the service will be scaled to 0. Then if we make a request to the service, the service will automatically be scaled up.Documentation Index
Fetch the complete documentation index at: https://www.truefoundry.com/llms.txt
Use this file to discover all available pages before exploring further.
Configuring Scale to 0
In the Service deployment form, enable theAdvanced Fields toggle and then you will see the Auto Shutdown section as shown below.

How does Scale to 0 work?
Scale to 0 is powered by the Elasti. Its an open-source Kubernetes-native solution built by Truefoundry that offers scale-to-zero functionality on Kubernetes when there is no traffic and automatic scale up to 0 when traffic arrives. A brief summary about Elasti is:Most Kubernetes autoscaling solutions like HPA or Keda can scale from 1 to n replicas based on cpu utilization or memory usage. However, these solutions do not offer a way to scale to 0 when there is no traffic. Elasti solves this problem by dynamically managing service replicas based on real-time traffic conditions. It only handles scaling the application down to 0 replicas and scaling it back up to 1 replica when traffic is detected again. The scaling after 1 replica is handled by the autoscaler like HPA or Keda.Elasti uses a proxy mechanism that queues and holds requests for scaled-down services, bringing them up only when needed. The proxy is used only when the service is scaled down to 0. When the service is scaled up to 1, the proxy is disabled and the requests are processed directly by the pods of the service.
