April from Microsoft Developer presents practical techniques for evaluating AI agent output using GitHub Copilot and AI Toolkit, offering step-by-step insights for developers in this Pet Planner workshop segment.

Evaluating AI Agent Output with GitHub Copilot and AI Toolkit

Presenter: April (Microsoft Developer)

This video is part six in a workshop series combining AI Toolkit and GitHub Copilot, focusing specifically on how developers can evaluate the output of AI agents within a real-world scenario—the Pet Planner application.

Workshop Overview

Agenda & Chapter Markers

  • 00:00–00:02: Introduction
  • 00:03–01:19: Workshop Progress Recap
  • 01:20–02:57: Choosing Evaluators with Copilot
  • 02:58–07:00: Creating a Dataset using Copilot
  • 07:01–16:50: Reviewing Evaluation Plan and Creating an Evaluation Script
  • 16:51–18:50: Reviewing Evaluation Output
  • 18:51–22:41: Creating an Evaluation Report Using Copilot

Key Technical Steps Demonstrated

1. Preparing for Evaluation

  • Overview of the context for evaluating AI agent output within the Pet Planner app.
  • Emphasis on integrating Copilot and AI Toolkit tools for a streamlined workflow.

2. Choosing Evaluators Using Copilot

  • Selecting suitable metrics and evaluators for measuring agent effectiveness.
  • Using Copilot’s suggestion functionality for best-fit evaluator selection.

3. Dataset Creation

  • Demonstrated how Copilot can assist with generating and curating datasets for evaluation.
  • Tips on setting up sample data for realistic agent assessment scenarios.

4. Building the Evaluation Script

  • April shows how to assemble scripts for automated agent output evaluation using Copilot in Agent mode and features offered by AI Toolkit.
  • Covers script customizations and practical considerations for robust evaluation.

5. Reviewing and Reporting Results

  • Steps for reviewing evaluation results, troubleshooting issues, and analyzing key findings.
  • Generating comprehensive evaluation reports with actionable recommendations using Copilot.

Technologies & Tools Highlighted

  • GitHub Copilot (Agent Mode): Used for scripting and streamlining the workflow.
  • Microsoft AI Toolkit: Assists in model evaluation setup and management.
  • Microsoft Foundry: Platform context for project management and evaluation.
  • Azure: Underlying cloud infrastructure enabling these workflows.

Summary

This session provides actionable, step-by-step instruction on how to evaluate the performance of AI agents, leveraging Microsoft Copilot, AI Toolkit, and Azure-powered infrastructure. The workshop is designed for developers looking to build, test, and iterate on real-world AI applications.