April from Microsoft Developer presents practical techniques for evaluating AI agent output using GitHub Copilot and AI Toolkit, offering step-by-step insights for developers in this Pet Planner workshop segment.

Evaluating AI Agent Output with GitHub Copilot and AI Toolkit

Presenter: April (Microsoft Developer)

This video is part six in a workshop series combining AI Toolkit and GitHub Copilot, focusing specifically on how developers can evaluate the output of AI agents within a real-world scenario—the Pet Planner application.

Workshop Overview

Series Context: Part of the AI Toolkit + GitHub Copilot Pet Planner workshop
Related Resources:

Agenda & Chapter Markers

00:00–00:02: Introduction
00:03–01:19: Workshop Progress Recap
01:20–02:57: Choosing Evaluators with Copilot
02:58–07:00: Creating a Dataset using Copilot
07:01–16:50: Reviewing Evaluation Plan and Creating an Evaluation Script
16:51–18:50: Reviewing Evaluation Output
18:51–22:41: Creating an Evaluation Report Using Copilot

Key Technical Steps Demonstrated

1. Preparing for Evaluation

Overview of the context for evaluating AI agent output within the Pet Planner app.
Emphasis on integrating Copilot and AI Toolkit tools for a streamlined workflow.

2. Choosing Evaluators Using Copilot

Selecting suitable metrics and evaluators for measuring agent effectiveness.
Using Copilot’s suggestion functionality for best-fit evaluator selection.

3. Dataset Creation

Demonstrated how Copilot can assist with generating and curating datasets for evaluation.
Tips on setting up sample data for realistic agent assessment scenarios.

4. Building the Evaluation Script

April shows how to assemble scripts for automated agent output evaluation using Copilot in Agent mode and features offered by AI Toolkit.
Covers script customizations and practical considerations for robust evaluation.

5. Reviewing and Reporting Results

Steps for reviewing evaluation results, troubleshooting issues, and analyzing key findings.
Generating comprehensive evaluation reports with actionable recommendations using Copilot.

Technologies & Tools Highlighted

GitHub Copilot (Agent Mode): Used for scripting and streamlining the workflow.
Microsoft AI Toolkit: Assists in model evaluation setup and management.
Microsoft Foundry: Platform context for project management and evaluation.
Azure: Underlying cloud infrastructure enabling these workflows.

Summary

This session provides actionable, step-by-step instruction on how to evaluate the performance of AI agents, leveraging Microsoft Copilot, AI Toolkit, and Azure-powered infrastructure. The workshop is designed for developers looking to build, test, and iterate on real-world AI applications.