Content by Alberto Martinez (1)
Alberto Martinez explains how to reduce cost and latency in AI coding assistants by routing simple tasks (like docstrings) to smaller on-device models on Snapdragon NPUs, while reserving larger cloud models for complex work. The session outlines a three-tier routing architecture, quantization trade-offs, and a deployable classifier approach.
End of content