Deploying GPT-OSS-20B as a Sidecar AI Model on Azure App Service
TulikaC demonstrates deploying the GPT-OSS-20B open-weight language model as a sidecar container with a Python Flask app on Azure App Service, offering developers a scalable and secure platform for AI-backed applications.
Deploying GPT-OSS-20B as a Sidecar AI Model on Azure App Service
Author: TulikaC
Overview
OpenAI’s GPT-OSS-20B is an open, high-performing language model available under the Apache 2.0 license. This guide describes how to deploy it as a sidecar container with a Python Flask web application on Azure App Service, leveraging Azure’s managed capabilities for scale, security, and DevOps.
Key Benefits
- Efficient Model Hosting: Run GPT-OSS-20B with your own application instance, optimizing for cost, privacy, and control.
- Enterprise Azure Features: Use built-in autoscaling, robust security, VNet integration, and seamless CI/CD in Azure App Service.
- Developer-Friendly Deployment: Rely on containers, Bicep templates, and Azure Container Registry for maintainability.
Solution Architecture
- Application: Python Flask web app, served via Azure App Service (code-based).
- AI Model: GPT-OSS-20B, running as a sidecar container using Ollama within the same App Service instance.
- Communication: The Flask app sends requests to the sidecar model container using
localhost:11434
for inference. - Infrastructure as Code: Bicep scripts provision all Azure resources (Web App, ACR, networking, etc.).
![Architecture diagram omitted]
1. Containerizing GPT-OSS-20B
The model is pre-packaged in a Docker image (using Ollama as a runtime). Example Dockerfile:
FROM ollama/ollama
EXPOSE 11434
COPY startup.sh /
RUN chmod +x /startup.sh
ENTRYPOINT ["./startup.sh"]
startup.sh
script:
# Start background model server
ollama serve & sleep 5
# Download the gpt-oss:20b model
ollama pull gpt-oss:20b
# Restart Ollama in foreground to serve the model
pkill -f "ollama"
ollama serve
2. Flask Application Integration
The app connects to the GPT-OSS-20B model running in the sidecar using local HTTP requests. Key code sample:
OLLAMA_HOST = "http://localhost:11434"
MODEL_NAME = "gpt-oss:20b"
@app.route("/chat", methods=["POST"])
def chat():
data = request.get_json()
prompt = data.get("prompt", "")
payload = {
"model": MODEL_NAME,
"messages": [{"role": "user", "content": prompt}],
"stream": True
}
def generate():
with requests.post(f"{OLLAMA_HOST}/api/chat", json=payload, stream=True) as r:
for line in r.iter_lines(decode_unicode=True):
if line:
event = json.loads(line)
if "message" in event:
yield event["message"]["content"]
return Response(generate(), mimetype="text/plain")
- Real-Time Streaming: The endpoint streams model responses as they’re generated, suitable for chat UIs.
3. Azure Deployment Workflow
Resources are provisioned using Bicep templates:
- Azure Container Registry (ACR): Stores the Ollama/GPT-OSS-20B container image.
- Azure App Service (Premium V4): Hosts the Flask app and attaches the model container as a sidecar.
Deployment Steps
- Build and push the GPT-OSS-20B sidecar image to ACR.
- Deploy the Flask app via VS Code, CLI, GitHub Actions, or Bicep (see template).
-
Example:
azd init azd up
-
- In the Azure Portal, attach the sidecar image to your Web App using port 11434 (sidecar docs).
- On initial run, the sidecar downloads the model—subsequent restarts are faster.
Additional Azure Features
- Autoscaling to match demand
- VNet and security integration for compliance
- CI/CD support via GitHub Actions or Azure Pipelines
- Monitoring and observability via App Insights
Conclusion
Running GPT-OSS-20B as a sidecar on Azure App Service combines the speed and freedom of open-source models with the scalability, security, and ease of use of the Azure managed platform. This architecture is ideal for:
- Lightweight enterprise chatbots
- AI-powered feature prototypes
- Experimenting with self-hosted models, including future model swaps or domain-specific tuning
You can easily adapt this approach for other open AI models, container runtimes, or frameworks.
Resources & Next Steps
- Sample Repository (all Bicep/code)
- GPT-OSS announcement (OpenAI)
- Azure App Service Sidecars
- Azure App Service Premium V4
- Pushing Images to Azure Container Registry
- Advanced AI architectures: RAG on Azure
This post appeared first on “Microsoft Tech Community”. Read the entire article here