How we ship models in VS Code | LIVE161

Name: How we ship models in VS Code | LIVE161
Uploaded: 2026-06-05T13:37:30+00:00
Description: Julia Kasper and Seth Juarez give an inside look at how the VS Code and Copilot teams evaluate and ship AI model updates, including how they test model...

Today by Julia Kasper, Seth Juarez

Julia Kasper and Seth Juarez share how the VS Code and Copilot teams approach selecting, testing, and rolling out AI models for different tasks.

Overview

The session explains why shipping “the right model for the right task” is hard in practice, and how the team uses structured evaluation to decide when a model change is safe to roll out.

Key themes covered:

Model selection complexity: why different model families can behave differently even when given the same prompt, and how that affects product quality.
Comparing model behavior: evaluating multiple models side-by-side using identical prompts to understand differences in output quality and failure modes.
An “AI harness” approach: using a harness to support testing, evaluation, and debugging of model behavior.
Iteration and optimization: collaborative improvement loops that include prompt tuning and ongoing refinement.
Benchmarks and evaluation process: using benchmarks and repeated evaluation cycles to guide decisions about updates.
Balancing capability with reliability: deciding when to roll out updates while minimizing regressions and maintaining user trust.

Resources

VS Code GitHub repo: https://aka.ms/VSCode/GHRepo
DB view resource: https://aka.ms/VSCode/DBview
Harness blog: https://aka.ms/VSCode/HarnessBlog