Red-Teaming a RAG App with the Azure AI Evaluation SDK
Pamela Fox demonstrates how to use the Azure AI Evaluation SDK to automate red-teaming of a RAG application, analyzing risks when deploying LLMs and showing security outcomes across different AI models.
Red-Teaming a RAG App with the Azure AI Evaluation SDK
Author: Pamela Fox
Introduction
Deploying user-facing applications powered by large language models (LLMs) carries the risk of producing unsafe outputs—such as content that encourages violence, hate speech, or self-harm. Manual testing is only a partial solution, since malicious users may craft highly creative inputs that bypass superficial filters.
The Challenge of Red-Teaming LLM Applications
Red-teaming is the process of rigorously probing a system for vulnerabilities, often with experts designing malicious prompts to assess weaknesses. Traditional red-teaming is resource-intensive and not practical for every iteration of an LLM-powered app.
The Automated Red Teaming Agent from Microsoft
Microsoft addresses this challenge with its automated Red Teaming agent, delivered via the azure-ai-evaluations
Python package. This tool:
- Uses an adversarial LLM, safely sandboxed within Azure AI Foundry
- Automatically generates unsafe query prompts across different risk categories
- Applies known transformation and obfuscation attacks (using the pyrit package: base64, URL encoding, ciphers, etc.)
- Evaluates both original and transformed queries against your app
- Assesses if your app leaks answers to unsafe queries
Testing a Retrieval-Augmented Generation (RAG) Application
Pamela tested this process on her RAG-on-PostgreSQL sample application, which uses RAG techniques to answer product queries from a sample outdoors store database. The app retrieves top product details based on user queries and sends them, along with a customer service prompt, to an LLM.
Red-Teaming Results Across Models
Pamela ran the agent against multiple backend models: Azure OpenAI gpt-4o-mini, Meta’s Llama3.1:8b via Ollama, and Hermes3:3b via Ollama. Results included:
Model | Host | Attack Success Rate |
---|---|---|
gpt-4o-mini | Azure OpenAI | 0% 🥳 |
llama3.1:8b | Ollama | 2% |
hermes3:3b | Ollama | 12.5% 😭 |
- gpt-4o-mini (Azure OpenAI): 0% attack success, attributed to robust content safety filters and reinforcement learning from human feedback (RLHF).
- llama3.1:8b (Ollama): Low (2%) success, indicating effective RLHF even on local models.
- hermes3:3b (Ollama): Higher (12.5%) success, with self-harm prompts being most successful (31.25% in that category), likely reflecting less training in filtering such content.
Analysis included breakdowns by attack category and complexity, and example attacks highlighted subtle failures—particularly when prompt context accidentally lent legitimacy to unsafe queries.
Lessons and Mitigations
- Azure AI Content Safety API: For models with higher attack rates, layering Microsoft’s safety APIs and rerunning the red-teaming process is recommended.
- Prompt Engineering: It helps but may not suffice against sophisticated attacks.
- Comprehensive Testing: A robust, multi-faceted red-teaming scan is vital before deploying to production, especially for models lacking integrated guardrails.
Conclusion
Using tools like the Azure AI Evaluation SDK makes security evaluation scalable, repeatable, and more accessible to typical development teams. Pamela’s hands-on results reveal concrete risks and best practices for deploying LLM-powered apps in production—underscoring the need for layered safety controls and ongoing automated testing.
References:
- Azure AI Red Teaming Agent Documentation
- azure-ai-evaluations Python Package
- Azure AI Content Safety Overview
- RAG-on-PostgreSQL Sample Application
This post appeared first on “Microsoft Tech Community”. Read the entire article here