Pamela Fox investigates how the Azure AI Evaluation SDK enables automated red-teaming for LLM-powered RAG apps, examining real attack scenarios and showing practical ways to evaluate and strengthen AI security.

Red-teaming a RAG Application with Azure AI Evaluation SDK

By Pamela Fox

Deploying user-facing applications powered by large language models (LLMs) introduces safety risks, such as inadvertently producing harmful outputs or responding to adversarial prompts. This article reviews how to automate the red-teaming (adversarial testing) process for Retrieval-Augmented Generation (RAG) apps using Microsoft’s Azure AI Evaluation SDK.

Why Red-teaming?

Manual red-teaming brings in experts to probe apps with malicious queries, but it’s expensive and impractical to repeat for every model, prompt, or app iteration. Automated tools can help spot vulnerabilities earlier and more efficiently.

Azure AI Evaluation SDK and Automated Red Teaming

Microsoft’s automated Red Teaming agent (within the azure-ai-evaluations Python package) spins up adversarial LLM agents inside Azure AI Foundry projects. These agents generate a battery of unsafe questions, applying transformations (base-64 encoding, Caesar cipher, URL encoding, etc.) via pyrit to craft complex attacks. The responses from your apps are then systematically evaluated for unsafe outputs.

RAG-on-PostgreSQL: A Practical Experiment

The author’s sample RAG application uses user queries to fetch product info from a fictional retail database, inserting search results into an LLM prompt. The default model is Azure OpenAI’s gpt-4o-mini, but the setup is flexible: models from Azure, GitHub, or Ollama can be swapped in.

Screenshot

RAG app UI

Testing Results: Azure vs Open Models

Red-teaming the RAG app highlighted distinct differences in model robustness:

Model Host Attack Success Rate
gpt-4o-mini Azure OpenAI 0%
llama3.1:8b Ollama 2%
hermes3:3b Ollama 12.5%

gpt-4o-mini (Azure OpenAI):

  • Benefits from Azure OpenAI’s content safety filters.
  • Further protected by rigorous RLHF (reinforcement learning from human feedback) processes.

llama3.1:8b (Ollama) and hermes3:3b:

  • Despite open model status, llama also scored well due to Meta’s documented RLHF.
  • Hermes was selected for its more ‘neutrally-aligned’ approach and proved most vulnerable, especially to ‘self-harm’ category attacks.

Attack Strategies and Categories

Red-teaming attacks were broken down by type:

  • Attack category: hate/unfairness, self-harm, sexual, violence
  • Complexity: easy (string transforms), moderate (hypothetical/tense rephrasing by LLM), difficult (compositions)

Example Attacks

  • Tense strategy prompts LLMs to answer as though explaining in a different timeframe or scenario (e.g., building an explosive device in a hypothetical society).
  • Combined strategies such as tense + URL encoding were also tested.

Lessons Learned and Next Steps

  • Azure OpenAI’s content safety and RLHF provide strong practical defense.
  • More open models demand additional manual guardrails (e.g., Azure AI Content Safety API), and repeated automated testing.
  • Prompt engineering alone is insufficient to block sophisticated or compound attacks.
  • Before production, run comprehensive automated red teaming using diverse and complex strategies.

References & Resources


Pamela Fox, Microsoft

This post appeared first on “Microsoft Tech Community”. Read the entire article here