High-performance inference meets serverless compute with NVIDIA RTX PRO 6000 on Cloud Run

High-performance inference meets serverless compute with NVIDIA RTX PRO 6000 on Cloud Run

Running large-scale inference models can involve significant operational toil, including cluster management and manual VM maintenance. One solution is to leverage a serverless compute platform to abstract away the underlying infrastructure. Today, we’re bringing the serverless experience to high-end inference with support for NVIDIA RTX PRO™ 6000 Blackwell Server Edition GPUs on Cloud Run. Now in preview, you can deploy massive models like Gemma 3 27B or Llama 3.1 70B with the ‘deploy and forget’ experience you’ve come to expect from Cloud Run. No reservations. No cluster management. Just code.

A powerful GPU platform

1

The NVIDIA RTX PRO 6000 Blackwell GPU provides a huge leap in performance compared to the NVIDIA L4 GPU, bringing 96GB vGPU memory, 1.6 TB/s of bandwidth and support for FP4 and FP6. This means you can serve up to 70B+ parameter models without having to manage any underlying infrastructure. Cloud Run lets you attach a NVIDIA RTX PRO 6000 Blackwell GPU to your Cloud Run service, job, or worker pools, on demand, with no reservations required. Here are some ways you can use the NVIDIA RTX PRO 6000 Blackwell GPU to accelerate your business:

  • Generative AI and inference: With its FP4 precision support, the NVIDIA RTX PRO 6000 Blackwell GPU’s high-efficiency compute accelerates LLM fine-tuning and inference, letting you create real-time generative AI applications such as multi-modal and text-to-image creation models. By running your model on Cloud Run services, you can also take advantage of rapid startup and scaling, going from zero instances to having a GPU with drivers installed under 5 seconds. When traffic eventually scales down zero and no more requests are being received, Cloud Run automatically scales your GPU instances down to zero.
  • Fine-tuning and offline inference: NVIDIA RTX PRO 6000 Blackwell GPUs can be used in conjunction with Cloud Run jobs to fine-tune your model. The fifth-generation NVIDIA Tensor Cores can be used in conjunction with AI models to help accelerate rendering pipelines and enhance content creation. 
  • Tailored scaling for specialized workloads: Use GPU-enabled worker pools to apply granular control over your GPU workers, whether you need to dynamically scale based on custom external metrics or manually provision “always-on” instances for complex, stateful processing.

We built Cloud Run to be the simplest way to run production-ready, GPU-accelerated tasks. Some highlights of Cloud Run include: 

  • Managed GPUs with flexible compute: Cloud Run pre-installs the necessary NVIDIA drivers so you can focus on your code. Cloud Run instances using NVIDIA RTX PRO 6000 Blackwell GPUs can configure up to 44 vCPU and 176GB of RAM.

  • Production-grade reliability: By default, Cloud Run offers zonal redundancy, helping to ensure enough capacity for your service to be resilient to a zonal outage; this also applies to Cloud Run with GPUs. Alternatively, you can turn off zonal redundancy and benefit from a lower price for best-effort failover of your GPU workloads in case of a zonal outage.

  • Tight integration: Cloud Run works natively with the rest of Google Cloud. You can load massive model weights by mounting Cloud Storage buckets as local volumes, or use Identity-Aware Proxy (IAP) to secure traffic that’s bound for a Cloud Run service.

Get started

The NVIDIA RTX PRO 6000 Blackwell GPU is available in preview on demand with availability in us-central1 and europe-west4, and limited availability in asia-south2 and asia-southeast1. You can deploy your first service using Ollama, one of the easiest way to run open models, on Cloud Run with NVIDIA RTX PRO 6000 GPUs enabled:

code_block
<ListValue: [StructValue([('code', 'gcloud beta run deploy my-service rn–image ollama/ollama –port 11434 rn–cpu 20 –memory 80Gi rn–gpu-type nvidia-rtx-pro-6000 rn–no-gpu-zonal-redundancy rn–region us-central1'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7fce9dd4ddc0>)])]>

For more details, check out our updated Cloud Run documentation and AI inference best practices