AI Hypercomputer software updates: Faster training and inference, a new resource hub, and more

The potential of AI has never been greater, and infrastructure plays a foundational role in driving it forward. AI Hypercomputer is our supercomputing architecture based on performance-optimized hardware, open software, and flexible consumption models. Together, these offer exceptional performance and efficiency, resiliency at scale, and give you the flexibility to choose offerings at each layer to suit your needs.

Today, we’re announcing major updates to the AI Hypercomputer software layer for training and inference performance, improved resiliency at scale, as well as a centralized hub for AI Hypercomputer resources.

AI Hypercomputer resources on Github

AI Hypercomputer’s open software layer not only supports leading ML Frameworks and orchestration options, but also provides workload optimizations and reference implementations to improve the time-to-value for your specific use case. To make the innovations in our open software stack easily accessible to developers and practitioners, we are introducing AI Hypercomputer GitHub organization, a central place where you can discover reference implementations such as MaxText, MaxDiffusion, orchestration tools such as xpk (the Accelerated Processing Kit for cluster creation and workload management), and performance recipes for GPUs on Google Cloud. We’ll continue to add to this list and adapt these resources to a rapidly evolving landscape, and we invite you to contribute with us.

aside_block: <ListValue: [StructValue([('title', '$300 in free credit to try Google Cloud infrastructure'), ('body', <wagtail.rich_text.RichText object at 0x3edff3fd3ca0>), ('btn_text', 'Start building for free'), ('href', 'http://console.cloud.google.com/freetrial?redirectPath=/compute'), ('image', None)])]>

MaxText now supports A3 Mega VMs

MaxText is a high-performance, highly scalable, open-source reference implementation for large language models (LLMs). You can now use performance-optimized LLM training examples for A3 Mega VMs, which are powered by NVIDIA H100 Tensor Core GPUs and offer a 2X improvement in GPU-to-GPU network bandwidth over A3 VMs. We worked closely with NVIDIA to optimize JAX and XLA, enabling the overlap of collective communication and computation on GPUs. Additionally, we added optimized model configurations and example scripts for GPUs with XLA flags enabled.

MaxText with A3 Mega VMs can deliver near linear scaling of training performance as you scale the number of VMs in the cluster, as demonstrated below by Llama2-70b pre-training.

[1]-Llama2-70b Training Performance on A3-Mega — Figure-1a: Google internal data for Llama2-70b (MaxText) pretraining on A3 Mega. Relative performance (bf16 training) vs ideal scaling. A3 Mega training with bf16 also demonstrates close to linear scaling.

Furthermore, you can use FP8 mixed-precision training on A3 Mega VMs to achieve additional acceleration and hardware utilization. We added FP8 support in MaxText via Accurate Quantized Training (AQT), the same quantization library that also powers INT8 mixed-precision training on Cloud TPUs.

Our benchmarks on dense models indicate FP8 training with AQT can deliver up to 55% improvement in effective model flops utilization (EMFU) compared to bf16. You can find recipes and optimal training examples for A3 Mega here.

[2]-Llama2-70b Training Performance on A3-Mega (1) — Figure-1a: Google internal data for Llama2-70b (MaxText) pretraining on A3 Mega. Effective MFU (EFMU) is computed with base bf16 peakflops for both bf16 and fp8 mixed-precision training. Sequence length is 4096 tokens.

Reference implementations and kernels for MoEs

For most mixture of experts (MoE) use cases, it’s useful to have consistent resource utilization of a limited number of experts. However, for certain applications, the ability to use more experts to develop richer responses is more important. To give you this flexibility, we’ve now expanded MaxText to include both “capped” and “no-cap” MoE implementations so you can choose the best implementation for your model architecture. Capped MoE models offer predictable performance, while no-cap models dynamically allocate resources for optimal efficiency.

To further accelerate MoE training, we’ve open-sourced Pallas kernels, which are optimized for block-sparse matrix multiplication on Cloud TPUs (Pallas is an extension to JAX to allow fine-grained control over code generated for XLA devices such as GPUs and TPUs; Block-sparse matmul is currently available only for TPUs). These kernels can be used with both PyTorch and JAX, providing high-performance building blocks for training your MoE models.

[3]-Weak Scaling with Mixtral 8x7b — Figure-2: Google internal data for Mixtral-8x7b (MaxText) pretraining on Cloud TPU v5p. Sequence length is 4096 tokens. Weak scaling is measured with fixed per-device batch size.

Our benchmarks with no-cap MoE model (Mixtral-8x7b) indicate near linear scaling with fixed per device batch size (Figure-2). We also observed close to linear scaling when we increased the number of experts (Figure-3) in the base configuration with the number of accelerators, indicative of performance on models with higher sparsity as well.

[4]-Weak Scaling with Mixtral-Nx7b Variation — Figure-3: Google internal data for Mixtral-8xNb (MaxText) pretraining on Cloud TPU v5p.Sequence length is 4096 tokens. Weak scaling is measured by increasing the number of experts (N) with the size of the cluster ranging from 64-512 v5p chips.

Monitoring large-scale training

Having large clusters of accelerators that are expected to work together as a unit on a training task can complicate MLOps. You may have questions such as “Did host transfer latencies spike for a reason?” or “Why is this one device in a segfault?” Yet, monitoring large-scale training jobs with the right metrics is imperative to maximizing your resource utilization and improving overall ML Goodput. To simplify this crucial part of your MLOps charter, we’ve introduced a reference monitoring recipe. This recipe helps you create a Cloud Monitoring dashboard within your Google Cloud project that shows useful statistical measures such as average or max CPU utilization and helps identify outliers in the setup so you can take corrective actions.

SparseCore on Cloud TPU v5p is now GA

Recommender models, as well as models that rely on embeddings, require high-performance random memory access to use those embeddings. SparseCore, the TPU’s hardware accelerator for embeddings, enables you to build more powerful and efficient recommendation systems. Each Cloud TPU v5p chip has four dedicated SparseCores delivering up to 2.5X performance improvement for DLRM-V2 over its predecessor.

[5]-Relative Training Throughput DLRM-V2

Improving LLM inference performance

Finally, to improve LLM inference performance, we introduced KV cache quantization and ragged attention kernels in JetStream, an open-source throughput-and-memory-optimized engine for LLM inference. Together, these enhancements improve inference performance by up to 2X on Cloud TPU v5e.

[6]-JetStream Throughput (TPU v5e-8) — JetStream throughput (output tokens / second). Google internal data. Baseline: Accelerate AI Inference with Google Cloud TPUs and GPUs

Current: Measured using Gemma 7B (MaxText), Llama 2 7B (PyTorch/XLA), and Llama 2 70B on Cloud TPU v5e-8. Maximum input length: 1024, maximum output length: 1024.

Empowering your AI journey

From pushing the boundaries of model training and inference, to enhancing accessibility through a central resource repository, each component of the AI Hypercomputer is a building block for the next generation of AI. We envision a future where AI practitioners can seamlessly scale from concept to production, unburdened by infrastructure limitations.

Explore the latest AI Hypercomputer resources, including the optimized recipes, accelerated processing kit, reference implementations and more.