The age of AI is here, and organizations everywhere are racing to deploy powerful models to drive innovation, enhance products, and create entirely new user experiences. But moving from a trained model in a lab to a scalable, cost-effective, and production-grade inference service is a significant engineering challenge. It requires deep expertise in infrastructure, networking, security, and all of the Ops (MLOps, LLMOps, DevOps, etc.).
Today, we’re making it dramatically simpler. We’re excited to announce the GKE inference reference architecture: a comprehensive, production-ready blueprint for deploying your inference workloads on Google Kubernetes Engine (GKE).
This isn’t just another guide; it’s an actionable, automated, and opinionated framework designed to give you the best of GKE for inference, right out of the box.
Start with a strong foundation: The GKE base platform
Before you can run, you need a solid place to stand. This reference architecture is built on the GKE base platform. Think of this as the core, foundational layer that provides a streamlined and secure setup for any accelerated workload on GKE.
Built on infrastructure-as-code (IaC) principles using Terraform, the base platform establishes a robust foundation with the following:
-
Automated, repeatable deployments: Define your entire infrastructure as code for consistency and version control.
-
Built-in scalability and high availability: Get a configuration that inherently supports autoscaling and is resilient to failures.
-
Security best practices: Implement critical security measures like private clusters, Shielded GKE Nodes, and secure artifact management from the start.
-
Integrated observability: Seamlessly connect to Google Cloud Observability for deep visibility into your infrastructure and applications.
Starting with this standardized base ensures you’re building on a secure, scalable, and manageable footing, accelerating your path to production.
Why the inference-optimized platform?
The base platform provides the foundation, and the GKE inference reference architecture is the specialized, high-performance engine that’s built on top of it. It’s an extension that’s tailored specifically to solve the unique challenges of serving machine learning models.
Here’s why you should start with our accelerated platform for your AI inference workloads:
1. Optimized for performance and cost
Inference is a balancing act between latency, throughput, and cost. This architecture is fine-tuned to master that balance.
-
Intelligent accelerator use: It streamlines the use of GPUs and TPUs, so you can use custom compute classes to ensure that your pods land on the exact hardware they need. With node auto-provisioning (NAP), the cluster automatically provisions the right resources, when you need them.
-
Smarter scaling: Go beyond basic CPU and memory scaling. We integrate a custom metrics adapter that allows the Horizontal Pod Autoscaler (HPA) to scale your models. Scaling is based on real-world inference metrics like queries per second (QPS) or latency, ensuring you only pay for what you use.
-
Faster model loading: Large models mean large container images. We leverage the Container File System API and Image streaming in GKE along with Cloud Storage FUSE to dramatically reduce pod startup times. Your containers can start while the model data streams in the background, minimizing cold-start latency.
2. Built to scale any inference pattern
Whether you’re doing real-time fraud detection, batch processing analytics, or serving a massive frontier model, this architecture is designed to handle it. It provides a framework for the following:
-
Real-time (online) inference: Prioritizes low-latency responses for interactive applications.
-
Batch (offline) inference: Efficiently processes large volumes of data for non-time-sensitive tasks.
-
Streaming inference: Continuously processes data as it arrives from sources like Pub/Sub.
The architecture leverages GKE features like the cluster autoscaler and the Gateway API for advanced, flexible, and powerful traffic management that can handle massive request volumes gracefully.
3. Simplified operations for complex models
We’ve baked in features to abstract away the complexity of serving modern AI models, especially LLMs. The architecture includes guidance and integrations for advanced model optimization techniques such as quantization (INT8/INT4), tensor and pipeline parallelism, and KV Cache optimizations like Paged and Flash Attention.
Furthermore, with GKE in Autopilot mode, you can offload node management entirely to Google, so you can focus on your models, not your infrastructure.
Get started today!
Ready to build your inference platform on GKE? The GKE inference reference architecture is available today in the Google Cloud Accelerated Platforms GitHub repository. The repository contains everything that you need to get started, including the Terraform code, documentation, and example use cases.
We’ve included examples for deploying popular workloads like ComfyUI and a general-purpose online inference with GPUs and TPUs to help you get started quickly.
By combining the rock-solid foundation of the GKE base platform with the performance and operational enhancements of the inference reference architecture, you can deploy your AI workloads with confidence, speed, and efficiency. Stop reinventing the wheel and start building the future on GKE.
The future of AI on GKE
The GKE inference reference architecture is more than just a collection of tools, it’s a reflection of Google’s commitment to making GKE the best platform for running your inference workloads. By providing a clear, opinionated, and extensible architecture, we are empowering you to accelerate your AI journey and bring your innovative ideas to life.
We’re excited to see what you’ll build with the GKE inference reference architecture. Your feedback is welcome! Please share your thoughts in the GitHub repository.