Pamela Fox demonstrates how to use the Azure AI Evaluation SDK to automate red-teaming of a RAG application, analyzing risks when deploying LLMs and showing security outcomes across different AI models.

Red-Teaming a RAG App with the Azure AI Evaluation SDK

Author: Pamela Fox

Introduction

Deploying user-facing applications powered by large language models (LLMs) carries the risk of producing unsafe outputs—such as content that encourages violence, hate speech, or self-harm. Manual testing is only a partial solution, since malicious users may craft highly creative inputs that bypass superficial filters.

The Challenge of Red-Teaming LLM Applications

Red-teaming is the process of rigorously probing a system for vulnerabilities, often with experts designing malicious prompts to assess weaknesses. Traditional red-teaming is resource-intensive and not practical for every iteration of an LLM-powered app.

The Automated Red Teaming Agent from Microsoft

Microsoft addresses this challenge with its automated Red Teaming agent, delivered via the azure-ai-evaluations Python package. This tool:

Uses an adversarial LLM, safely sandboxed within Azure AI Foundry
Automatically generates unsafe query prompts across different risk categories
Applies known transformation and obfuscation attacks (using the pyrit package: base64, URL encoding, ciphers, etc.)
Evaluates both original and transformed queries against your app
Assesses if your app leaks answers to unsafe queries

Testing a Retrieval-Augmented Generation (RAG) Application

Pamela tested this process on her RAG-on-PostgreSQL sample application, which uses RAG techniques to answer product queries from a sample outdoors store database. The app retrieves top product details based on user queries and sends them, along with a customer service prompt, to an LLM.

Example App Screenshot

Red-Teaming Results Across Models

Pamela ran the agent against multiple backend models: Azure OpenAI gpt-4o-mini, Meta’s Llama3.1:8b via Ollama, and Hermes3:3b via Ollama. Results included:

Model	Host	Attack Success Rate
gpt-4o-mini	Azure OpenAI	0% 🥳
llama3.1:8b	Ollama	2%
hermes3:3b	Ollama	12.5% 😭

gpt-4o-mini (Azure OpenAI): 0% attack success, attributed to robust content safety filters and reinforcement learning from human feedback (RLHF).
llama3.1:8b (Ollama): Low (2%) success, indicating effective RLHF even on local models.
hermes3:3b (Ollama): Higher (12.5%) success, with self-harm prompts being most successful (31.25% in that category), likely reflecting less training in filtering such content.

Analysis included breakdowns by attack category and complexity, and example attacks highlighted subtle failures—particularly when prompt context accidentally lent legitimacy to unsafe queries.

Lessons and Mitigations

Azure AI Content Safety API: For models with higher attack rates, layering Microsoft’s safety APIs and rerunning the red-teaming process is recommended.
Prompt Engineering: It helps but may not suffice against sophisticated attacks.
Comprehensive Testing: A robust, multi-faceted red-teaming scan is vital before deploying to production, especially for models lacking integrated guardrails.

Conclusion

Using tools like the Azure AI Evaluation SDK makes security evaluation scalable, repeatable, and more accessible to typical development teams. Pamela’s hands-on results reveal concrete risks and best practices for deploying LLM-powered apps in production—underscoring the need for layered safety controls and ongoing automated testing.

References:

This post appeared first on “Microsoft Tech Community”. Read the entire article here