Accelerate your gen AI: Deploy Llama4 & DeepSeek on AI Hypercomputer with new recipes

Accelerate your gen AI: Deploy Llama4 & DeepSeek on AI Hypercomputer with new recipes

The pace of innovation in open-source AI is breathtaking, with models like Meta’s Llama4 and DeepSeek AI’s DeepSeek. However, deploying and optimizing large, powerful models can be  complex and resource-intensive. Developers and machine learning (ML) engineers need reproducible, verified recipes that articulate the steps for trying out the models on available accelerators. 

Today, we’re excited to announce enhanced support and new, optimized recipes for the latest Llama4 and DeepSeek models, leveraging our cutting-edge AI Hypercomputer platform. AI Hypercomputer helps build a strong AI infrastructure foundation using a set of purpose-built infrastructure components that are designed to work well together for AI workloads like training and inference. It is a systems-level approach that draws from our years of experience serving AI experiences to billions of users, and combines purpose-built hardware, optimized software and frameworks, and flexible consumption models. Our AI Hypercomputer resources repository on GitHub, your hub for these recipes, continues to grow.

In this blog, we’ll show you how to access Llama4 and DeepSeek models today on AI Hypercomputer. 

Added support for new Llama4 models 

Meta recently released the Scout and Maverick models in the Llama4 herd of models. Llama 4 Scout is a 17 billion active parameter model with 16 experts, and Llama 4 Maverick is a 17 billion active parameter model with 128 experts. These models deliver innovations and optimizations based on a Mixture of Experts (MoE) architecture. They support multimodal capability and long context length. 

But serving these models can present challenges in terms of deployment and resource management. To help simplify this process, we’re releasing new recipes for serving Llama4 models on Google Cloud Trillium TPUs and A3 Mega and A3 Ultra GPUs.

  • JetStream, Google’s throughput and memory-optimized engine for LLM inference on XLA devices, now supports Llama-4-Scout-17B-16E and Llama-4-Maverick-17B-128E inference on Trillium, the sixth-generation TPU. New recipes now provide the steps to deploy these models using JetStream and MaxText on a Trillium TPU GKE cluster. vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. New recipes now demonstrate how to use vLLM to serve the Llama4 Scout and Maverick models on A3 Mega and A3 Ultra GPU GKE clusters. 

  • For serving the Maverick model on TPUs, we utilize Pathways on Google Cloud. Pathways is a system which simplifies large-scale machine learning computations by enabling a single JAX client to orchestrate workloads across multiple large TPU slices. In the context of inference, Pathways enables multi-host serving across multiple TPU slices. Pathways is used internally at Google to train and serve large models like Gemini.

  • MaxText provides high performance, highly scalable, open-source LLM reference implementations for OSS models written in pure Python/JAX and targeting Google Cloud TPUs and GPUs for training and inference. MaxText now includes reference implementations for Llama4 Scout and Maverick models and includes information on how to perform checkpoint conversion, training, and decoding for Llama4 models.

aside_block
<ListValue: [StructValue([('title', '$300 in free credit to try Google Cloud AI and ML'), ('body', <wagtail.rich_text.RichText object at 0x3e20b1d75e80>), ('btn_text', 'Start building for free'), ('href', 'http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/'), ('image', None)])]>

Added support for DeepSeek Models

Earlier this year, Deepseek released two open-source models: the DeepSeek-V3 model followed by DeepSeek-R1 model. The V3 model provides model innovations and optimizations based on an MoE-based architecture. The R1 model provides reasoning capabilities through the chain-of-thought thinking process. 

To help simplify deployment and resource management, we’re releasing new recipes for serving DeepSeek models on Google Cloud Trillium TPUs and A3 Mega and A3 Ultra GPUs.

  • JetStream now supports DeepSeek-R1-Distill-Llama70B inference on Trillium. A new recipe now provides the steps to deploy DeepSeek-R1-Distill-Llama-70B using JetStream and MaxText on a Trillium TPU VM. With the recent ability to work with Google Cloud TPUs, vLLM users can leverage the performance-cost benefits of TPUs with a few configuration changes. vLLM on TPU now supports all DeepSeek R1 Distilled models on Trillium. Here’s a recipe which demonstrates how to use vLLM, a high-throughput inference engine, to serve the DeepSeek distilled Llama model on Trillium TPUs.

  • You can also deploy DeepSeek Models using the SGLang Inference stack on our A3 Ultra VMs powered by eight NVIDIA H200 GPUs with this recipe. A recipe for A3 Mega VMs with SGLang is also available, which shows you how to deploy multihost inference utilizing two A3 Mega nodes. Cloud GPU users using the vLLM Inference engine can also deploy DeepSeek Models on the A3 Mega (recipe) and A3 Ultra (recipe) VMs.

  • MaxText now also includes support for architectural innovations from DeepSeek such as MLA – Multi-Head Latent Attention, MoE Shared and Routed Experts with Loss Free Load Balancing, Expert Parallelism support with Dropless, Mixed Decoder Layers ( Dense and MoE ) and YARN RoPE embeddings. The reference implementations for the DeepSeek family of models allows you to rapidly experiment with your models by incorporating some of these newer architectural enhancements.

Recipe example

The reproducible recipes show the steps to deploy and benchmark inference with the new Llama4 and DeepSeek models. For example, this TPU recipe outlines the steps to deploy the Llama-4-Scout-17B-16E Model with JetStream MaxText Engine with Trillium TPU. The recipe shows steps to provision the TPU cluster, download the model weights and set up JetStream and MaxText. It then shows you how to convert the checkpoint to a compatible format for MaxText, deploy it on a JetStream server, and run your benchmarks. 

Typical recipe outline : 

  1. Ensure pre-requisites are met.

  2. Setup development environment.

  3. Provision a GKE Cluster with Trillium TPU and CPU nodepool

  4. Create container image with dependencies

  5. Checkpoint conversion

    • Download model weights from HuggingFace

    • Convert the checkpoint from Hugging Face format to JAX Orbax format

    • Unscan checkpoint for performant serving

  6. Deploy JetStream and Pathways (for multihost serving)

  7. Run MMLU benchmark

Bring up the Llama4 server with the JetStream Engine with the following config:

code_block
<ListValue: [StructValue([('code', 'python3 -m MaxText.maxengine_server rn /maxtext/MaxText/configs/base.yml rn scan_layers=false rn model_name=llama4-17b-16e rn weight_dtype=bfloat16 rn base_output_directory=$BASE_OUTPUT_PATH rn run_name=serving-run rn load_parameters_path=$CHECKPOINT_TPU_UNSCANNED rn sparse_matmul=false rn ici_tensor_parallelism=8 rn max_prefill_predict_length=1024 rn force_unroll=false rn max_target_length=2048 rn hf_access_token=$HF_TOKEN'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x3e20b02f7d60>)])]>

Run various benchmarks on this server. Eg: To run MMLU, use the JetStream benchmarking script like this:

code_block
<ListValue: [StructValue([('code', 'JAX_PLATFORMS=tpu python3 /JetStream/benchmarks/benchmark_serving.py rn –tokenizer meta-llama/Llama-4-Scout-17B-16E rn –use-hf-tokenizer 1 rn –hf-access-token $HF_TOKEN rn –num-prompts 14037 rn –dataset mmlu rn –dataset-path $MMLU_DATASET_PATH rn –request-rate 0 rn –warmup-mode sampled rn –save-request-outputs rn –num-shots=5 rn –run-eval True rn –model=llama4-17b-16e rn –save-result rn –request-outputs-file-path mmlu_outputs.json'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x3e20b02f7430>)])]>

Build with us

You can deploy Llama4 Scout and Maverick models or DeepSeekV3/R1 models today using inference recipes from the AI Hypercomputer Github repository. These recipes provide a starting point for deploying and experimenting with Llama4 models on Google Cloud. Explore the recipes and resources linked below, and stay tuned for future updates. We hope you have fun building and share your feedback!

When you deploy open models like DeepSeek and Llama, you are responsible for its security and legal compliance. You should follow responsible AI best practices, adhere to the model’s specific licensing terms, and ensure your deployment is secure and compliant with all regulations in your area.

Model

Accelerator

Framework

Inference Recipe link

Llama-4-Scout-17B-16E

Trillium (TPU v6e)

JetStream Maxtext

Recipe

Llama-4-Maverick-17B-128E

Trillium (TPU v6e)

JetStream Maxtext + Pathways on Cloud

Recipe

Llama-4-Scout-17B-16E

Llama-4-Scout-17B-16E-Instruct

Llama-4-Maverick-17B-128E

Llama-4-Maverick-17B-128E-Instruct

A3 Ultra (8xH200)

vLLM

Recipe

A3 Mega (8xH100)

vLLM

Recipe

 

Model

Accelerator

Framework

Inference Recipe link

DeepSeek-R1-Distill-Llama-70B

Trillium (TPU v6e)

JetStream Maxtext

TPU-VM recipeGKE + TPU recipe

DeepSeek-R1-Distill-Llama-70B

Trillium (TPU v6e)

vLLM

Recipe

DeepSeek R1 671B

A3 Ultra (8xH200)

vLLM

Recipe

DeepSeek R1 671B

A3 Ultra (8xH200)

SGLang

Recipe

DeepSeek R1 671B

A3 Mega (16xH100)

vLLM

Recipe

DeepSeek R1 671B

A3 Mega (16xH100)

SGLang

Recipe