Cary_Chai provides a clear walkthrough on deploying OpenAI’s gpt-oss-120b and gpt-oss-20b models in Azure Container Apps with serverless GPU support via Ollama, outlining deployment steps, technical considerations, and tips for scalable AI workloads in the cloud.

Deploying OpenAI’s gpt-oss Models to Azure Container Apps Serverless GPUs with Ollama

Author: Cary_Chai

OpenAI recently introduced the open-weight gpt-oss-120b and gpt-oss-20b models, making it easier for developers to self-host powerful language models. This guide demonstrates how to deploy these models using Ollama containers on Azure Container Apps serverless GPUs, taking advantage of the platform’s autoscaling, managed identity, and enterprise features.

Why Use Azure Container Apps Serverless GPUs?

Azure Container Apps offers a serverless environment for running containerized applications. With GPU backing (A100 or T4), developers can run AI models at scale without managing infrastructure. Key benefits include:

Autoscaling: Scales to zero when idle, scales out automatically
Pay-per-second billing: Cost optimization by paying only for actual usage
Ease of Deployment: Simple container onboarding, managed networking, and identity
Enterprise Features: VNET, private endpoints, compliance, and data governance

Choosing the Right gpt-oss Model

gpt-oss-120b: Powerful, designed for high-performance inference, comparable to gpt-o4-mini. Requires A100 GPUs.
gpt-oss-20b: Lightweight, ideal for smaller workloads, similar to gpt-o3-mini. Runs efficiently on T4 GPUs (or A100).

Region	A100	T4
West US	✅
West US 3	✅	✅
Sweden Central	✅	✅
Australia East	✅	✅
West Europe		✅

Step-by-Step Deployment Guide

1. Create Azure Container App

Navigate to the Azure Portal
Create a new resource > Search for ‘Azure Container Apps’ > Create
On the Basics tab, select the appropriate region (A100 or T4 available)

2. Configure the Container

Image Source: Docker Hub (or another registry)
Image Name: ollama/ollama:latest
Workload Profile: Consumption
GPU: Enable GPU, select A100 for 120b or T4/A100 for 20b
If quota is insufficient, request more here

3. Configure Ingress

Ingress Enabled: Yes
Traffic: Accept from anywhere
Target Port: 11434

4. Review and Create

Click ‘Review + Create’, then ‘Create’ to deploy the resource.

Running the gpt-oss Model on Azure Container Apps

Access the deployed resource and use the Application URL to launch your container app.
For hands-on interaction, use the Azure Portal’s Monitoring > Console:
- Start Ollama:
```
ollama serve
```
- Pull the desired model:
```
ollama pull gpt-oss:120b
# or for 20b:
ollama pull gpt-oss:20b
```
- Run the model:
```
ollama run gpt-oss:120b
```
- Enter prompts to interact with the LLM.
(Optional) To keep containers running for extended sessions, adjust replica or cooldown settings under ‘Scaling’.

Calling the Ollama gpt-oss API from Your Local Machine

Set the OLLAMA_URL to your Application URL:

export OLLAMA_URL="{Your application URL}"

Make an API call using curl:

curl -X POST "$OLLAMA_URL/api/generate" -H "Content-Type: application/json" \
  -d '{"model":"gpt-oss:120b", "prompt":"Can you explain LLMs and recent developments in AI the last few years?", "stream":false}'

Persisting State and Data

Azure Container Apps are ephemeral. To persist data (e.g., model state or logs), add a volume mount following official documentation.

Additional Resources

Summary

By following this guide, you can self-host high-performance language models like gpt-oss-120b and gpt-oss-20b on Azure’s fully managed, scalable, and cost-efficient serverless GPU infrastructure. The process covers best practices for deployment, operation, and integration for advanced AI workloads.

This post appeared first on “Microsoft Tech Community”. Read the entire article here