Skip to main content
The architecture of a TrueFoundry compute plane is as follows:

This section is coming soon.

PolicyDescription
ELBControllerPolicyRole assumed by load balancer controller to provision ELB when a service of type LoadBalancer is created
KarpenterPolicy and SQSPolicyRole assumed by Karpenter to dynamically provision nodes and handle spot node termination
EFSPolicyRole assumed by EFS CSI to provision and attach EFS volumes
EBSPolicyRole assumed by EBS CSI to provision and attach EBS volumes
RolePolicy with policies for:- ECR, S3, SSM, EKS
Use the trust relationship.
Role assumed by TrueFoundry to allow access to ECR, S3, and SSM services. If you are using TrueFoundry’s control plane the role will be assumed by arn:aws:iam::416964291864:role/tfy-ctl-euwe1-production-truefoundry-deps otherwise it will be your control plane’s IAM role
ClusterRole with policies:
- AmazonEKSClusterPolicy
- AmazonEKSVPCResourceControllerPolicy
- EncryptionPolicy
Role that provides Kubernetes permissions to manage the cluster lifecycle, networking, and encryption
NodeRole with policies: AmazonEC2ContainerRegistryReadOnlyPolicy, AmazonEKS_CNI_Policy, AmazonEKSWorkerNodePolicy, AmazonSSMManagedInstanceCorePolicyRole assumed by EKS nodes to work with AWS resources for ECR access, IP assignment, and cluster registration
EncryptionPolicy to create and manage key for encryption:
{  
    "Statement": [  
        {  
            "Action": [  
                "kms:Encrypt",  
                "kms:Decrypt",  
                "kms:ListGrants",  
                "kms:DescribeKey"  
            ],  
            "Effect": "Allow",  
            "Resource": "arn:aws:kms:<region>:<aws_account_id>:key/<key_id>"  
        }  
    ],  
    "Version": "2012-10-17"  
}
Setting up TrueFoundry control plane on your own cloud involves creating the infrastructure to support the platform and then installing the platform itself.

Setting up Infrastructure

Requirements:

The requirements to setup control plane in each of the scenarios is as follows:
  • Billing and STS must be enabled for the AWS account.
  • Please make sure you have enough quotas for GPU/Inferentia instances on the account depending on your usecase. You can check and increase quotas at AWS EC2 service quotas
  • Please make sure you have created a certifcate for your domain in AWS Certificate Manager (ACM) and have the ARN of the certificate ready. This is required to setup TLS for the load balancer.
  • Postgres database with the following requirements:
    • Version: >= 13
    • Instance Types: db.t3.medium or db.t4g.medium
    • Storage: 20GB of type gp3 with autoscale enabled to 30GB
    • Encryption: Enabled
    • For PostgreSQL 17+: Set force_ssl parameter to 0 (off) in parameter group if you need to allow non-SSL connections (default is 1)
    • Security Group: Ensure RDS security group allows inbound traffic from EKS node security groups
  • S3 bucket to store the intermediate code while building the docker image.
  • Egress Access for TrueFoundry Auth: Egress access to https://auth.truefoundry.com and analytics.truefoundry.com is needed to verify the users logging into the TrueFoundry platform for licensing purposes.
  • DNS: Domain for control plane and service endpoints. One endpoint to point to the control plane service (e.g., platform.example.com) and the other to point to the compute plane service (e.g., tfy.example.com/service1). The control-plane URL must be reachable from the compute-plane. The developers will need to access the TrueFoundry UI at the provided domain.
  • We will need a certificate ARN (for the domain provided above) to attach to the loadbalancer so as to terminate TLS traffic at the load balancer. This will allow the services we deploy on the cluster to be accessed via HTTPS. We recommend using AWS Certificate Manager to add TLS to the load balancer. You can read the instructions in Step 2 below on how to create the certificate in AWS Certificate Manager.
  • You need to have enough permissions on the AWS account to create the resources needed for the compute plane. Check this for more details. We usually recommend admin permission on the AWS account, but if you need the exact set of fine-grained permissions, you can check the list of permissions below:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "iam:CreateInstanceProfile",
                "iam:DeleteInstanceProfile",
                "rds:AddTagsToResource",
                "iam:GetInstanceProfile",
                "iam:RemoveRoleFromInstanceProfile",
                "rds:DeleteTenantDatabase",
                "iam:AddRoleToInstanceProfile",
                "rds:CreateDBInstance",
                "rds:DescribeDBInstances",
                "rds:RemoveTagsFromResource",
                "rds:CreateTenantDatabase",
                "iam:TagInstanceProfile",
                "rds:DeleteDBInstance"
            ],
            "Resource": [
                "arn:aws:iam::$ACCOUNT_ID:instance-profile/*",
                "arn:aws:rds:$REGION:$ACCOUNT_ID:db:$CLUSTER_NAME*"
            ]
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "rds:AddTagsToResource",
                "rds:DeleteDBSubnetGroup",
                "rds:DescribeDBSubnetGroups",
                "iam:DeleteOpenIDConnectProvider",
                "iam:GetOpenIDConnectProvider",
                "rds:CreateDBSubnetGroup",
                "rds:ListTagsForResource",
                "rds:RemoveTagsFromResource",
                "iam:TagOpenIDConnectProvider",
                "iam:CreateOpenIDConnectProvider",
                "rds:CreateDBInstance",
                "rds:DeleteDBInstance"
            ],
            "Resource": [
                "arn:aws:rds:$REGION:$ACCOUNT_ID:subgrp:$CLUSTER_NAME*",
                "arn:aws:iam::$ACCOUNT_ID:oidc-provider/*"
            ]
        },
        {
            "Sid": "VisualEditor9",
            "Effect": "Allow",
            "Action": [
                "rds:DescribeDBInstances"
            ],
            "Resource": [
                "arn:aws:rds:$REGION:$ACCOUNT_ID:db:*"
            ]
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": [
                "iam:CreatePolicy",
                "iam:GetPolicyVersion",
                "iam:GetPolicy",
                "iam:ListPolicyVersions",
                "iam:DeletePolicy",
                "iam:TagPolicy"
            ],
            "Resource": [
                "arn:aws:iam::$ACCOUNT_ID:policy/tfy-*",
                "arn:aws:iam::$ACCOUNT_ID:policy/truefoundry-*",
                "arn:aws:iam::$ACCOUNT_ID:policy/AmazonEKS_Karpenter_Controller_Policy*",
                "arn:aws:iam::$ACCOUNT_ID:policy/AmazonEKS_CNI_Policy*",
                "arn:aws:iam::$ACCOUNT_ID:policy/AmazonEKS_AWS_Load_Balancer_Controller*",
                "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess"
            ]
        },
        {
            "Sid": "VisualEditor3",
            "Effect": "Allow",
            "Action": [
                "iam:ListPolicies",
                "elasticfilesystem:*",
                "iam:GetRole",
                "s3:ListAllMyBuckets",
                "kms:*",
                "ec2:*",
                "s3:ListBucket",
                "route53:AssociateVPCWithHostedZone",
                "sts:GetCallerIdentity",
                "eks:*"
            ],
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor4",
            "Effect": "Allow",
            "Action": "dynamodb:*",
            "Resource": "arn:aws:dynamodb:$REGION:$ACCOUNT_ID:table/$CLUSTER_NAME*"
        },
        {
            "Sid": "VisualEditor5",
            "Effect": "Allow",
            "Action": "iam:*",
            "Resource": [
                "arn:aws:iam::$ACCOUNT_ID:role/$CLUSTER_NAME*",
                "arn:aws:iam::$ACCOUNT_ID:role/$CLUSTER_NAME*"
            ]
        },
        {
            "Sid": "VisualEditor6",
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::$CLUSTER_NAME*/*",
                "arn:aws:s3:::$CLUSTER_NAME*/*",
                "arn:aws:s3:::$CLUSTER_NAME*",
                "arn:aws:s3:::$CLUSTER_NAME*",
                "arn:aws:s3:::$CLUSTER_NAME*",
                "arn:aws:s3:::$CLUSTER_NAME*/*"
            ]
        },
        {
            "Sid": "VisualEditor7",
            "Effect": "Allow",
            "Action": "events:*",
            "Resource": "arn:aws:events:$REGION:$ACCOUNT_ID:rule/$CLUSTER_NAME*"
        },
        {
            "Sid": "VisualEditor8",
            "Effect": "Allow",
            "Action": "sqs:*",
            "Resource": "arn:aws:sqs:$REGION:$ACCOUNT_ID:$CLUSTER_NAME*"
        }
    ]
}
Regarding the VPC and EKS cluster, you can decide between the following scenarios:
  • New VPC and New EKS Cluster
  • Existing VPC and New EKS Cluster
  • Existing EKS Cluster
  1. The new VPC should will have a CIDR range of /20 or larger, at least 2 availability zones and private subnets with CIDR /24 or larger. This is to ensure capacity for ~250 instances and 4096 pods.
  2. If you want to use a smaller network range for your EKS cluster, TrueFoundry supports EKS custom networking as well.
  3. A NAT gateway will be provisioned to provide internet access to the private subnets.
  4. We should have egress access to public.ecr.aws, quay.io, ghcr.io, tfy.jfrog.io, docker.io/natsio, nvcr.io, registry.k8s.io so that we can download the docker images for argocd, nats, gpu operator, argo rollouts, argo workflows, istio, keda, etc.

Setting up control plane

TrueFoundry control plane infrastructure is provisioned using terraform. You can download the terraform code for your exact account by filling up your account details and downloading a script that can be executed on your local machine. To perform the below steps, you need to register an account on TrueFoundry and login to the platform.
1

Choose to create a new cluster or attach an existing cluster

Go to the platform section in the left panel and click on Clusters. Add the following value at the end of your URL &controlPlaneSetupEnabled=true. This will enable the control plane installation for you. You can click on Create New Cluster or Attach Existing Cluster depending on your use case. Read the requirements and if everything is satisfied, click on Continue.
2

Get Domain and Certificate ARN

We will need two domains and certificate ARNs to point to the load balancer that we will be creating in the next step. Let’s say you have a domain like *.services.example.com - we will be creating a DNS record with this later in Step 6. We recommend using AWS Certificate Manager (ACM) to create the certificate since it’s easier to manage and renew the certificates automatically. To generate a certificate ARN, please follow the steps below. If you are not using AWS Certificate Manager, you can skip this step and continue to the next step.
  1. Navigate to AWS Certificate Manager in the AWS console
  2. Request a public certificate
  3. Specify your domain (e.g., *.services.example.com)
  4. Choose DNS validation (recommended)
  5. Add the CNAME records provided by ACM to your DNS provider. Follow the official AWS guide for DNS validation. For detailed steps on adding CNAME records, see AWS documentation on DNS validation
  6. Wait for the certificate to change to “Active” status (this may take 30 minutes or longer)
  7. Copy the certificate ARN for the next step (format will be like: arn:aws:acm:region:account:certificate/certificate-id)
3

Fill up the form to generate the terraform code

A form will be presented with the details for the new cluster to be created. Fill in with your cluster details. Click Submit when done
  • Create New Cluster
  • Attach Existing Cluster
The key fields to fill up here are:
  • Cluster Name - A name for your cluster.
  • Region - The region where you want to create the cluster.
  • Network Configuration - Choose between New VPC or Existing VPC depending on your use case.
  • Authentication - This is how you are authenticated to AWS on your local machine. It’s used to configure Terraform to authenticate with AWS.
  • S3 Bucket for Terraform State - Terraform state will be stored in this bucket. It can be a preexisting bucket or a new bucket name. The new bucket will automatically be created by our script.
  • Control Plane Configuration - Control plane URL and the database details. You can chose between PostgreSQL on kubernetes or Managed PostgreSQL (RDS) or Existing PostgreSQL configuration depending on your use case.
  • Load Balancer Configuration - This is to configure the load balancer for your cluster. You can choose between Public or Private Load Balancer, it defaults to Public. You can also add certificate ARNs and domain names for the load balancer but these are optional.
Enter the domain and the certificate ARN that we got in previous step in the form as shown below.
4

Copy the curl command and execute it on your local machine

You will be presented with a curl command to download and execute the script. The script will take care of installing the pre-requisites, downloading terraform code and running it on your local machine to create the cluster. This will take around 40-50 minutes to complete.
5

Create DNS Record

Once the script is executed, create the DNS record for the control plane url. To get the load balancer IP address, you can check the kubernetes service of type LoadBalancer in the istio-system namespace. You can run the following command to get the IP address.
kubectl get svc -n istio-system tfy-istio-ingress -ojsonpath='{.status.loadBalancer.ingress[0].hostname}'
This will give you the login screen to the control plane through which you can login via the same credentials used to register the tenant. Create a DNS record in your route 53 or your DNS provider with the following details
Record TypeRecord NameRecord Value
CNAMECONTROL_PLANE_DOMAINLOADBALANCER_IP_ADDRESS
6

Attach the compute plane to the control plane

We will need to attach the same cluster as compute plane so that we can manage it form the platform. For this, you need to go to the platform section in the left panel and click on Clusters. Click on Attach Existing Cluster and fill in the details of the control plane cluster. The key fields to fill up here are:
  • Cluster Name - The name of the cluster.
  • Cluster Addons - Unselect all the addons as we have installed them while bringing up the control plane.
  • Network Configuration - Networking configuration of the control plane cluster.
  • Authentication - This is how you are authenticated to AWS on your local machine. It’s used to configure Terraform to authenticate with AWS.
  • S3 Bucket for Terraform State - Terraform state will be stored in this bucket. It can be a preexisting bucket or a new bucket name. You can use the same bucket that we used for the control plane and change the bucket key to be used for terraform state file.
  • Platform Features - This is to decide which features like BlobStorage, ClusterIntegration, ParameterStore, DockerRegistry and SecretsManager will be enabled for your cluster. To read more on how these integrations are used in the platform, please refer to the platform features page.
7

Copy the curl command and execute it on your local machine

You will be presented with a curl command to download and execute the script. The script will take care of installing the pre-requisites, downloading terraform code and running it on your local machine to create the cluster. This will take around 40-50 minutes to complete.
8

Verify the cluster is showing as connected in the platform

Once the script is executed, the cluster will be shown as connected in the platform.
9

Start deploying workloads to your cluster

You can start by going here

FAQ

Yes, you can use cert-manager to add TLS to the load balancer and not use AWS Certificate Manager. You can follow the instructions here to install cert-manager and add TLS to the load balancer.
Yes, please consult this guide to add your own certificate and key files to the load balancer.