Optimizing Large-Scale AI Performance with Pretraining Validation on a Single Azure ND GB200 v6
Authored by Mishty Dhekial and Hugo Affaticati, this analysis explores single-VM benchmarking of the Llama3 8B model on Azure ND GB200 v6 using NVIDIA NeMo framework, offering concrete techniques and recommendations for optimizing large-scale AI training.
Optimizing Large-Scale AI Performance with Pretraining Validation on a Single Azure ND GB200 v6
by Mishty Dhekial (Software Engineer Intern) and Hugo Affaticati (Cloud Infrastructure Engineer)
Why Llama?
This guide focuses on the Llama3 8B model due to its wide use in research and industry as an open-weight large language model (LLM). Llama’s transformer-based architecture with multi-head self-attention, rotary positional embeddings, and grouped-query attention make it highly efficient for natural language processing. The model is featured in the Azure AI benchmarking guide for reproducible and transparent performance evaluation.
The 8B size is ideal for validation on Azure ND GB200 v6 VMs—it offers significant computational demand while fitting within the hardware limitations, utilizing all four Blackwell GPUs without overwhelming them. Larger models (like Llama3 70B) exceed VM memory capacity, while smaller ones underutilize available resources.
Azure ND GB200 v6 VM Overview
Azure ND GB200 v6 VMs are powered by two NVIDIA Grace CPUs and four Blackwell GPUs. They are designed for demanding AI and ML workloads, ensuring high throughput and reliability for machine learning model training.
Benchmarking Methodology
This study systematically explores parallelism parameters (tensor, pipeline, context, data), micro batch size, and mixed precision settings for optimizing Llama3 pretraining:
- Parameters swept: micro batch size (MBS 1–16); tensor/pipeline/context parallelism (1–4 each)
- Precision: fp16 with fp8 mixed; also recommends bf16 with fp8 mixed for accurate pretraining
- Platform: Single ND GB200 v6 VM on Azure
- Framework: NVIDIA NeMo
Parallelism Strategies
- Tensor Parallelism (TP): Splits model layers across GPUs for memory efficiency and scalability. Higher TP reduces per-GPU memory but increases communication overhead.
- Pipeline Parallelism (PP): Divides the model into sequential GPU stages; beneficial for extremely large models but may introduce synchronization bottlenecks in small-scale setups.
- Context Parallelism (CP): Partitions long input sequences over GPUs, reducing peak activation memory and supporting large-context models.
- Data Parallelism (DP): Replicates models independently on devices for batch-wise training; default configuration for scalability.
- Micro Batch Size (MBS): Defines the batch of data processed per parallel rank; affects memory usage, step time, and GPU utilization.
Telemetry and Performance Metrics
Tracking these metrics is critical for hardware and model efficiency:
- GPU Utilization
- Memory Usage
- Streaming Multiprocessor (SM) Clock Speed
- Training Step Time
- Training Step Loss
Larger batch sizes improve GPU usage and memory demands, while increased parallelism may introduce communication overhead and longer step times. Minimal pipeline parallelism is optimal for single-VM scenarios; higher TP and CP can benefit scalability but may impair speed.
Results and Recommendations
- Micro Batch Size: As MBS increases, step time decreases, while GPU utilization and memory usage rise. Optimal value balances speed and resource use.
- Tensor Parallelism: Increasing TP reduces memory per GPU but slows training times due to extra communication. TP=1 is fastest for this VM.
- Pipeline Parallelism: Higher PP adds synchronization overhead and idle time—best to keep minimal (PP=2 recommended for 4 GPUs).
- Context Parallelism: Higher CP decreases memory and clock speed but slows training; CP=1 is optimal for speed.
Optimal Configuration
- MBS = 4
- TP = 1
- PP = 2
- CP = 1
This setup achieved highest memory and GPU utilization with fast pretraining steps for the Llama3 8B model on Azure ND GB200 v6.
For full reproducibility, refer to the automated Azure AI benchmarking guide. The guide includes scripts and further documentation for scaling up to multi-node clusters.
Conclusion
Small inefficiencies in VM configuration can cause significant performance losses at scale. Methodical benchmarking and telemetry tracking allow practitioners to optimize AI infrastructure upfront, saving time and cost on large fleets. By following these recommendations, you can reliably validate, troubleshoot, and scale pretraining workloads on Azure ND GB200 v6 VMs.
For more on model validation and infrastructure benchmarking, visit the Azure High Performance Computing (HPC) Blog.
Updated Aug 18, 2025
Author: Hugo Affaticati, Microsoft. Contributor: Mishty Dhekial.
This post appeared first on “Microsoft Tech Community”. Read the entire article here