
Access Policies Overview
Access Policies Overview
| Policy | Description |
|---|---|
| Access required for Azure container registry, storage account | An azure container registry is used to store the docker images for the platform. A storage account is used to store the model artifacts. |
Azure AD application with Reader and Monitoring Reader on AKS | Reader and monitoring reader permission on AKS is used to access the cluster autoscaler logs in Log Analytics and read azure node pools. User should have access to create Azure AD application. |
Requirements:
The common requirements to setup compute plane in each of the scenarios is as follows:- Billing must be enabled for the Azure subscription.
- Ensure that Microsoft.Storage resource provider is registered. Check this link for more details.
- Egress access to container registries -
public.ecr.aws,quay.io,ghcr.io,tfy.jfrog.io,docker.io/natsio,nvcr.io,registry.k8s.ioso that we can download the docker images for argocd, nats, gpu operator, argo rollouts, argo workflows, istio, keda, etc. - We need a domain to map to the service endpoints and certificate to encrypt the traffic. A wildcard domain like *.services.example.com is preferred. TrueFoundry can do path based routing like
services.example.com/tfy/*, however, many frontend applications do not support this. - Enough quotas for CPU/GPU instances must be present depending on your usecase. You can check and increase quotas at Azure compute quotas
- Ensure that host encryption is enabled.
- New network and New AKS Cluster
- Existing network and New AKS Cluster
- Existing AKS Cluster
- The new VPC subnet should have a CIDR range of /24 or larger. Secondary ranges for pods (min /20) and services (min /24) are required. Secondary range can be from a non-routable range. This is to ensure capacity for ~250 instances and 4096 pods.
- User/serviceaccount to provision the infrastructure.
Setting up compute plane
TrueFoundry compute plane infrastructure is provisioned using OpenTofu/Terraform. You can download the OpenTofu/Terraform code for your exact account by filling up your account details and downloading a script that can be executed on your local machine.Enable Deployment Feature in the Platform (Optional)
To enable the deployment feature which allows you to deploy services through the platform, you need to enable it;

- In the left hand navigation, go to
SettingsthenPlatform Feature VisibilityunderPreferences - Click on
Editbutton. Then enable the toggle forEnable Deployment

- Click on
Savebutton.

Choose to create a new cluster or attach an existing cluster
Go to the platform section in the left panel and click on 
Clusters. You can click on Create New Cluster or Attach Existing Cluster depending on your use case. Read the requirements and if everything is satisfied, click on Continue.
Fill up the form to generate the OpenTofu/Terraform code
A form will be presented with the details for the new cluster to be created. Fill in with your cluster details. Click
Submit when done- Create New Cluster
- Attach Existing Cluster
The key fields to fill up here are:
-
Region- The region and availability zones where you want to create the cluster. -
Resource Group- The resource group where you want to create the cluster. Chose betweenNew Resource GrouporExisting Resource Groupdepending on your use case. -
Cluster Name- A name for your cluster. -
Kubernetes Version- The Kubernetes version for the cluster (e.g.1.34). -
Node Pools- Configure CPU and GPU node pools for the cluster. The form comes with sensible defaults (see below) which you can adjust based on your workload requirements. The default node pool configuration is:
Pool Type Instance Type Capacity Min Max initial (system) On-Demand Standard_D4ds_v5CPU 2 2 cpu On-Demand Standard_D4ds_v5CPU 0 2 cpu2x On-Demand Standard_D8ds_v5CPU 0 2 a10 On-Demand Standard_NV6ads_A10_v5GPU 0 2 t4 On-Demand Standard_NC4as_T4_v3GPU 0 2 The initial pool is the system node pool that runs TrueFoundry platform components (ArgoCD, Istio, tfy-agent, etc.) and must always be on-demand with at least 2 nodes. You can add, remove, or resize the other CPU/GPU pools to match your workload needs. GPU pools can be removed entirely if you don’t plan to run GPU workloads. Make sure you have sufficient Azure compute quotas for the instance types you select. -
Network Configuration- Choose betweenNew VnetorExisting Vnetdepending on your use case. -
DNS Configuration- Configure the DNS zone and domains that will point to the cluster’s load balancer. This also provisions a TLS certificate for those domains. Select New DNS Zone or Existing DNS Zone if you want TrueFoundry to manage DNS in Azure. If you use an external DNS provider (e.g., Route53, Cloudflare), you can skip this section.
-
Storage account (container) for OpenTofu/Terraform State- OpenTofu/Terraform state will be stored in this container. It can be a preexisting storage account or a new storage account name. The new storage account will automatically be created by our script. -
Platform Features- This is to decide which features like BlobStorage, ClusterIntegration using Azure AD and Container Registry will be enabled for your cluster. To read more on how these integrations are used in the platform, please refer to the platform features page.
Copy the curl command and execute it on your local machine
You will be presented with a 
curl command to download and execute the script. The script will take care of installing the pre-requisites, downloading OpenTofu/Terraform code and running it on your local machine to create the cluster. This will take around 40-50 minutes to complete.
Verify the cluster is showing as connected in the platform
Once the script is executed, the cluster will be shown as connected in the platform.
Create DNS Record
We can get the load-balancer’s IP address by going to the platform section in the bottom left panel under the Clusters section. Under the preferred cluster, you’ll see the load balancer IP address under the 
Create a DNS record in your Azure DNS Zone or your DNS provider with the following details
Base Domain URL section.
| Record Type | Record Name | Record value |
|---|---|---|
| A | *.tfy.example.com | LOADBALANCER_IP_ADDRESS |
Start deploying workloads to your cluster
You can start by going here
Permissions required to create the infrastructure
The IAM user should have the following permissions -- Contributor Role to the above Subscription
- Role Based Access Administrator to the above subscription
-
Either Azure AD Administrator or Azure AD Application Developer role to:
- Create app registrations and service principals
- Assign Reader role to AD application for read-only AKS cluster access
- Assign Monitoring Reader role to applications for cluster monitoring (Ref: How to add Azure admin permission
FAQ
Can I use my own certificate and key files to add TLS to the load balancer?
Can I use my own certificate and key files to add TLS to the load balancer?
If you have your own certificate files (for example, from another certificate provider or self-signed), you can use them directly with TrueFoundry.
-
Create a Kubernetes secret with your certificate and key, or create a self-signed certificate:
-
Once the secret is created, head over to the cluster page and navigate to the
tfy-istio-ingressadd-on. Add the secret name in thetfyGateway.spec.servers[1].tls.credentialNamesection and ensure thattfyGateway.spec.servers[1].port.protocolis set toHTTPS. Here we are usingexample-com-tlsas the secret name, which contains the certificate and key.
How do I add node pools after cluster creation?
How do I add node pools after cluster creation?
If you need to add or modify node pools after the cluster is created, you can do so using the Azure CLI. Set the following variables before running the commands:You can browse available instance types and pricing at azureprice.net.New node pools are automatically synced in TrueFoundry if the Azure AD application has
- On-Demand CPU
- Spot CPU
- On-Demand GPU
- Spot GPU
Reader access on the AKS cluster.When should I use spot vs on-demand node pools?
When should I use spot vs on-demand node pools?
| On-Demand | Spot | |
|---|---|---|
| Availability | Guaranteed — no interruptions | Can be reclaimed by Azure at any time |
| Cost | Standard pricing | Up to 60-90% cheaper |
| Best for | Production services, databases, platform components | Dev/test workloads, batch jobs, interruptible tasks |