Kaito comes in handy when there are requirements to deploy production-grade inference models within the Kubernetes cluster. Currently, it focuses on open-source LLMs and it has model presents for falcon, llama2, mistral, and phi2 model families so far.
For Kaito as an operator, I identify 3 main use cases in my personal use but there are many more and many more to come,
When there is a requirement to have an inference model within the cluster Kaito can abstract the complexity of creating, autoscaling, and managing GPU nodes. (using Karpenter APIs)
Automatic tuning of model parameters to fit GPU hardware using the provided preset configurations.
Model weights are built into the image so if a pod goes down, we can easily bring a new pod using the already downloaded image from the node.
The above session had 2 components a 20-minute theory and a 20-minute demo. All the commands related to the demo session are mentioned below. (Basically, these commands were taken from Kaito documentation and slightly modified according to my environment)
Prerequisites
Azure subscription
Azure CLI
Azure Developer CLI
kubectl
Helm
Azure vCPU quota (for Kaito you are required to have 12 vCPUs in Standard NCSv3 family)
You may check the vCPU quota by,
Step 1: Azure Managed Identity for Kaito provisioner
Step 2: Installing the Kaito Operator
Step 3: Create a Kaito preset workspace
Step 4: Test the falcon-7b-instruct workspace inference endpoint
That’s the end of the demo, thanks for showing up at the cloud-native Sri Lanka event and I hope to do more interesting sessions in future meetups.