Stop routing docstrings to 70B models with on-device AI on Snapdragon | BRKSP90

Alberto Martinez presents a Microsoft Build 2026 breakout on designing an inference-routing strategy for AI coding assistants so that simple coding tasks don’t automatically hit expensive, high-latency 70B+ cloud models.

Overview

The session focuses on a three-tier inference routing architecture that prioritizes keeping code local and reducing cloud spend:

The stated goals and outcomes include:

Architecture: three-tier inference routing

The core idea is to route requests based on task complexity rather than sending everything to the largest model.

Key elements called out in the session description:

Quantization and model trade-offs

The talk explicitly includes quantization trade-offs, framing quantization as part of making smaller models viable for on-device inference (especially when targeting an NPU).

Measurement and optimization loop

A named optimization framework is highlighted:

This positions routing as an ongoing tuning exercise rather than a one-time configuration.

Session structure (from chapters)