Evaluating AI Agent Output with GitHub Copilot and AI Toolkit (Pet Planner Workshop, Part 6)
April from Microsoft Developer presents practical techniques for evaluating AI agent output using GitHub Copilot and AI Toolkit, offering step-by-step insights for developers in this Pet Planner workshop segment.
Evaluating AI Agent Output with GitHub Copilot and AI Toolkit
Presenter: April (Microsoft Developer)
This video is part six in a workshop series combining AI Toolkit and GitHub Copilot, focusing specifically on how developers can evaluate the output of AI agents within a real-world scenario—the Pet Planner application.
Workshop Overview
- Series Context: Part of the AI Toolkit + GitHub Copilot Pet Planner workshop
- Related Resources:
Agenda & Chapter Markers
- 00:00–00:02: Introduction
- 00:03–01:19: Workshop Progress Recap
- 01:20–02:57: Choosing Evaluators with Copilot
- 02:58–07:00: Creating a Dataset using Copilot
- 07:01–16:50: Reviewing Evaluation Plan and Creating an Evaluation Script
- 16:51–18:50: Reviewing Evaluation Output
- 18:51–22:41: Creating an Evaluation Report Using Copilot
Key Technical Steps Demonstrated
1. Preparing for Evaluation
- Overview of the context for evaluating AI agent output within the Pet Planner app.
- Emphasis on integrating Copilot and AI Toolkit tools for a streamlined workflow.
2. Choosing Evaluators Using Copilot
- Selecting suitable metrics and evaluators for measuring agent effectiveness.
- Using Copilot’s suggestion functionality for best-fit evaluator selection.
3. Dataset Creation
- Demonstrated how Copilot can assist with generating and curating datasets for evaluation.
- Tips on setting up sample data for realistic agent assessment scenarios.
4. Building the Evaluation Script
- April shows how to assemble scripts for automated agent output evaluation using Copilot in Agent mode and features offered by AI Toolkit.
- Covers script customizations and practical considerations for robust evaluation.
5. Reviewing and Reporting Results
- Steps for reviewing evaluation results, troubleshooting issues, and analyzing key findings.
- Generating comprehensive evaluation reports with actionable recommendations using Copilot.
Technologies & Tools Highlighted
- GitHub Copilot (Agent Mode): Used for scripting and streamlining the workflow.
- Microsoft AI Toolkit: Assists in model evaluation setup and management.
- Microsoft Foundry: Platform context for project management and evaluation.
- Azure: Underlying cloud infrastructure enabling these workflows.
Summary
This session provides actionable, step-by-step instruction on how to evaluate the performance of AI agents, leveraging Microsoft Copilot, AI Toolkit, and Azure-powered infrastructure. The workshop is designed for developers looking to build, test, and iterate on real-world AI applications.