Deploying OpenAI’s GPT-OSS-20B on Azure AKS with KAITO and vLLM
maljazaery presents a practical step-by-step tutorial on deploying OpenAI’s GPT-OSS-20B model on Azure’s AKS using KAITO and vLLM. The guide covers the full technical workflow, including cluster creation, GPU setup, inference optimization, public exposure, endpoint testing, and load benchmarking.
Deploying OpenAI’s GPT-OSS-20B on Azure AKS with KAITO and vLLM
Deploy and optimize OpenAI’s first open-source large language model, GPT-OSS-20B, on Microsoft Azure’s cloud GPU infrastructure using the Kubernetes AI Toolchain Operator (KAITO) and vLLM for high-performance inference.
Introduction
OpenAI’s GPT-OSS-20B is a powerful open-source LLM. Deploying such a model for real-time inference at scale requires modern GPU hardware and efficient orchestration. Azure’s Standard_NV36ads_A10_v5 GPU instances alongside AKS and KAITO streamline resource provisioning, management, and AI workload deployment.
Prerequisites
- Active Azure subscription with permission to create resource groups and clusters
- Approved quota for GPUs (Standard_NVads_A10_v5 or similar)
- Familiarity with Kubernetes, kubectl, and Azure CLI
Step 1: Set Up Environment Variables
Define shell variables for resource uniqueness and region selection:
export RANDOM_ID="33000"
export REGION="swedencentral"
export AZURE_RESOURCE_GROUP="myKaitoResourceGroup$RANDOM_ID"
export CLUSTER_NAME="myClusterName$RANDOM_ID"
Step 2: Create Azure Resource Group
az group create \
--name $AZURE_RESOURCE_GROUP \
--location $REGION
Step 3: Enable AKS Preview Features
Install the aks-preview extension and register AI Toolchain features.
az extension add --name aks-preview
az extension update --name aks-preview
az feature register \
--namespace "Microsoft.ContainerService" \
--name "AIToolchainOperatorPreview"
Check registration status with az feature list
.
Step 4: Create AKS Cluster With AI Toolchain Operator
az aks create \
--location $REGION \
--resource-group $AZURE_RESOURCE_GROUP \
--name $CLUSTER_NAME \
--node-count 1 \
--enable-ai-toolchain-operator \
--enable-oidc-issuer \
--generate-ssh-keys
Step 5: Connect kubectl to Cluster
az aks get-credentials \
--resource-group ${AZURE_RESOURCE_GROUP} \
--name ${CLUSTER_NAME}
Step 6: Create KAITO Workspace YAML
Prepare a KAITO workspace specification file (workspace-gptoss.yaml
):
apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
name: workspace-gpt-oss-vllm-nv-a10
resource:
instanceType: "Standard_NV36ads_A10_v5"
count: 1
labelSelector:
matchLabels:
app: gpt-oss-20b-vllm
inference:
template:
spec:
containers:
- name: vllm-openai
image: vllm/vllm-openai:gptoss
imagePullPolicy: IfNotPresent
args:
- --model
- openai/gpt-oss-20b
- --swap-space
- "4"
- --gpu-memory-utilization
- "0.85"
- --port
- "5000"
ports:
- name: http
containerPort: 5000
resources:
limits:
nvidia.com/gpu: 1
cpu: "36"
memory: "440Gi"
requests:
nvidia.com/gpu: 1
cpu: "18"
memory: "220Gi"
readinessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 600
periodSeconds: 20
env:
- name: VLLM_ATTENTION_BACKEND
value: "TRITON_ATTN_VLLM_V1"
- name: VLLM_DISABLE_SINKS
value: "1"
Apply it:
kubectl apply -f workspace-gptoss.yaml
Step 7: Expose the Inference Service Publicly
Expose the deployment via LoadBalancer:
kubectl expose deployment workspace-gpt-oss-vllm-nv-a10 \
--type=LoadBalancer \
--name=workspace-gpt-oss-vllm-nv-a10-pub
kubectl get svc workspace-gpt-oss-vllm-nv-a10-pub
Step 8: Test the Endpoint
Use curl or the OpenAI Python SDK to send API-format requests:
Shell Test:
export CLUSTERIP=<public_ip>
kubectl run -it --rm --restart=Never curl \
--image=curlimages/curl -- \
curl -X POST http://$CLUSTERIP/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "openai/gpt-oss-20b", "messages": [{"role": "user", "content": "What is Kubernetes?"}], "max_tokens": 50, "temperature": 0}'
Python SDK:
from openai import OpenAI
client = OpenAI(base_url="http://<ip>:5000/v1/", api_key="EMPTY")
result = client.chat.completions.create(
model="openai/gpt-oss-20b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what MXFP4 quantization is."}
]
)
print(result.choices[0].message)
Step 9: Load Testing & Performance Benchmarking
The llm-load-test-azure tool can stress test inference endpoints. Example settings:
- Input Tokens per Request: 250
- Output Tokens per Request: ~1,500
- Test Duration: ~10 minutes
- Concurrency: 10
Key Metrics: | Metric | Value | Description | |————————-|———–|——————————————–| | TT_ACK | 0.60 s | Time to acknowledge the request | | TTFT | 0.77 s | Time to first token | | ITL | 0.038 s | Idle time per request | | TPOT | 0.039 s | Time per output token | | Avg. Response Time | 57.77 s | Total latency per request | | Output Tokens Throughput| 257.77/s | Average output tokens per second |
Concurrency Tradeoff:
- Lowering concurrency reduces per-request latency but may decrease throughput.
- Example: 10 vs 5 concurrency decreased average response time by ~21%; higher concurrency increases request throughput.
Conclusion
You have successfully:
- Provisioned an AKS cluster with GPU support
- Installed KAITO and deployed GPT-OSS-20B with vLLM
- Set up a public inference endpoint
- Validated with API testing and performance load tests
Scale your setup by adjusting GPU count, vLLM configuration, or integrating with production applications. This architecture gives you a flexible, open-source, cloud-native LLM deployment on Azure for advanced inference workloads.
Special thanks to Andrew Thomas, Kurt Niebuhr, and Sachi Desai for support during this deployment.
Author: maljazaery — Updated August 15, 2025
This post appeared first on “Microsoft Tech Community”. Read the entire article here