How we ship models in VS Code | LIVE161
Julia Kasper and Seth Juarez share how the VS Code and Copilot teams approach selecting, testing, and rolling out AI models for different tasks.
Overview
The session explains why shipping “the right model for the right task” is hard in practice, and how the team uses structured evaluation to decide when a model change is safe to roll out.
Key themes covered:
- Model selection complexity: why different model families can behave differently even when given the same prompt, and how that affects product quality.
- Comparing model behavior: evaluating multiple models side-by-side using identical prompts to understand differences in output quality and failure modes.
- An “AI harness” approach: using a harness to support testing, evaluation, and debugging of model behavior.
- Iteration and optimization: collaborative improvement loops that include prompt tuning and ongoing refinement.
- Benchmarks and evaluation process: using benchmarks and repeated evaluation cycles to guide decisions about updates.
- Balancing capability with reliability: deciding when to roll out updates while minimizing regressions and maintaining user trust.
Resources
- VS Code GitHub repo: https://aka.ms/VSCode/GHRepo
- DB view resource: https://aka.ms/VSCode/DBview
- Harness blog: https://aka.ms/VSCode/HarnessBlog