Mark Gitau explains how to benchmark Llama 2 70B and 405B models using MLPerf Inference v5.1 on Azure ND GB200 v6 VMs, detailing setup, performance highlights, and step-by-step replication guidance.

Benchmarking Llama 2 70B and 405B Models on Azure ND GB200 v6 with MLPerf Inference v5.1

Author: Mark Gitau (Software Engineer)

Introduction

This guide covers Azure’s MLPerf Inference v5.1 benchmarking results for the latest ND GB200 v6 virtual machines. Powered by two NVIDIA Grace CPUs and four Blackwell B200 GPUs, these VMs deliver high-performance environments for testing and deploying large language models like Llama 2 70B and Llama 3.1 405B.

Highlights

Llama 2 70B: Azure achieved 52,000 tokens/s offline on a single ND GB200 v6 VM (an 8% performance increase from previous records), scaling up to approximately 937,098 tokens/s with a full NVL72 rack.
Llama 3.1 405B: Azure results matched the top global submitters with 847 tokens/s, showing parity with both leading cloud and on-premises systems.

Step-by-Step Benchmark Replication on Azure

Prerequisites

An Azure account and access to ND GB200 v6-series (single node) VMs.

Environment Setup

Deploy VM: Launch a new ND GB200 v6 VM from the Azure portal.

Directory Preparation: Create a working directory on an NVMe mount.

export MLPERF_SCRATCH_PATH=/mnt/nvme/mlperf
mkdir -p $MLPERF_SCRATCH_PATH/data $MLPERF_SCRATCH_PATH/models $MLPERF_SCRATCH_PATH/preprocessed_data

Clone MLPerf Repo:

git clone https://github.com/mlcommons/submissions_inference_v5.1 $MLPERF_SCRATCH_PATH

Downloading Models and Datasets

Models:
- Llama 2 70B
- Llama 3.1 405B
Datasets:

Build and Run Benchmarks

Export variables:

export SUBMITTER=Azure SYSTEM_NAME=ND_GB200_v6

Build MLPerf Container:
- Navigate to the closed/Azure directory and execute:
```
make prebuild
make build
```

Build Engines and Run Benchmarks:

Build engines for Llama models:

make generate_engines RUN_ARGS="--benchmarks=llama2-70b,llama3.1-405b --scenarios=offline,server"

Execute benchmarks:

make run_harness RUN_ARGS="--benchmarks=llama2-70b,llama3.1-405b --scenarios=offline,server"

About MLPerf and MLCommons

MLPerf is a set of industry benchmarks from MLCommons, an open engineering consortium, designed to deliver unbiased evaluations of both training and inference performance for AI hardware, software, and cloud services. These benchmarks simulate realistic compute-intensive AI workloads for informed technology assessment and selection.

For more details, visit official guides and MLPerf documentation or Azure’s HPC blog.

This post appeared first on “Microsoft Tech Community”. Read the entire article here