Stop routing docstrings to 70B models with on-device AI on Snapdragon | BRKSP90
Alberto Martinez presents a Microsoft Build 2026 breakout on designing an inference-routing strategy for AI coding assistants so that simple coding tasks don’t automatically hit expensive, high-latency 70B+ cloud models.
Overview
The session focuses on a three-tier inference routing architecture that prioritizes keeping code local and reducing cloud spend:
- On-device tier (≤13B models) for low-complexity tasks (example given: generating docstrings)
- On-prem tier (14B–34B models) for medium-complexity tasks
- Cloud tier (70B+ models) for the hardest tasks
The stated goals and outcomes include:
- Cutting cloud tokens by 67%
- Reducing latency by 70%
- Keeping most code and requests local by default
Architecture: three-tier inference routing
The core idea is to route requests based on task complexity rather than sending everything to the largest model.
Key elements called out in the session description:
- A routing logic layer that decides which tier/model to use
- A classifier that determines task complexity and selects the appropriate tier
- A fallback mechanism (referred to in the chapter list as a “fifth ‘fallback’ classifier mechanism”)
Quantization and model trade-offs
The talk explicitly includes quantization trade-offs, framing quantization as part of making smaller models viable for on-device inference (especially when targeting an NPU).
Measurement and optimization loop
A named optimization framework is highlighted:
- Measure token cost
- Measure latency
- Iterate
This positions routing as an ongoing tuning exercise rather than a one-time configuration.
Session structure (from chapters)
- Quantitative breakdown of token usage in coding tasks
- Workload complexity distribution analysis (includes a reference to Claude Sonnet 4.6)
- Cost-savings discussion (including a claim of up to $24K daily savings potential)
- Tiered architecture introduction and model complexity framing
- Optimization framework and Q&A on the fallback classifier mechanism