GKE network interface at 10: From core connectivity to the AI backbone

GKE network interface at 10: From core connectivity to the AI backbone

It’s hard to believe it’s been over 10 years since Kubernetes first set sail, fundamentally changing how we build, deploy, and manage applications. Google Cloud was at the forefront of the Kubernetes revolution with Google Kubernetes Engine (GKE), providing a robust, scalable, and cutting-edge platform for your containerized workloads. Since then, Kubernetes has emerged as the preferred platform for workloads such as AI/ML.  

Kubernetes is all about sharing machine resources among the applications and pod networking is essential for the connectivity between workloads and services using the Container Network Interface (CNI).

As we celebrate the 10th year anniversary of GKE, let’s take a look at how we’ve built out network interfaces to provide you with the performant, secure, and flexible pod networking and how we’ve evolved our networking model to support AI workloads with the Kubernetes Network Driver.

Let’s take a look back in time and see how we got there.

2015-2017: Laying the CNI foundation with kubenet

In Kubernetes’s early days, we needed to establish reliable communication between containers. For GKE, we adopted a flat model of IP addressing so that the pods and the node could communicate freely with other resources in the Virtual Private Cloud (VPC) without going through tunnels and gateways. During these formative years, GKE’s early networking models often used kubenet, a basic network plugin. Kubenet provided a straightforward way to get clusters up and running, by creating a bridge on each node and allocating IP addresses to pods from a CIDR range dedicated to that node. While Google Cloud’s network handled routing between nodes, Kubenet was responsible for connecting pods to the node’s network and enabling basic pod-to-pod communication within the node.

During this time, we also introduced route-based clusters, which were based on Google Cloud routes, part of the Andromeda engine that powers all of cloud networking. The routes feature in Andromeda played a crucial role in IP address allocation and routing within the cluster network using VPC routing rules. This required advertising the pod IP ranges between the nodes. 

However, as Kubernetes adoption exploded and applications grew in scale and complexity, we faced challenges around managing IP addresses and achieving high-performance communication directly between pods across different parts of a VPC. This pushed us to develop a networking solution that was more deeply integrated with the underlying cloud network.

aside_block
<ListValue: [StructValue([('title', '$300 to try Google Cloud networking'), ('body', <wagtail.rich_text.RichText object at 0x7f67a13d6d30>), ('btn_text', ''), ('href', ''), ('image', None)])]>

2018-2019: Embracing VPC-native networking

To address these evolving needs and integrate with Google Cloud’s powerful networking capabilities, we introduced VPC-native networking for GKE. This marked a significant leap forward for how CNI operates in GKE, with alias IP ranges (the IP ranges that pods use in a node) becoming a cornerstone of the solution. VPC-native networking became the default and recommended approach, helping to increase the scale of the GKE clusters up to 15K nodes. With VPC-native clusters, the GKE CNI plugin ensures that pods receive IP addresses directly from your VPC network — they become first-class citizens on your network.

This shift brought a multitude of benefits:

  • Simplified IP management: GKE CNI plugin works with GKE to allocate pod IPs directly from the VPC, making them directly routable and easier to manage alongside your other cloud resources.

  • Enhanced security through VPC integration: Because pod IPs are VPC-native, you can apply VPC firewall rules directly to pods. 

  • Improved performance and scalability: GKE CNI plugin facilitates direct routing within the VPC, reducing overhead and improving network throughput for pod traffic.

  • A foundation for advanced CNI features: VPC-native networking laid the groundwork for more sophisticated CNI functionalities that followed.

We referred to GKE’s implementation of CNI’s with VPC-native networking as GKE standard networking with dataplane v1 (DPv1). During this time, we also announced GA support for network policies with Calico. Network policies allow you to specify rules for traffic flow within your cluster, and also between pods and the outside world.

2020 and beyond: The eBPF revolution

The next major evolution in GKE’s CNI strategy arrived with the power of extended Berkeley Packet Filter or eBPF, which lets you run sandboxed programs in a privileged context. eBPF makes it safe to program the Linux kernel dynamically, opening up new possibilities for networking, security, and observability at the CNI level without having to recompile the kernel.

Recognizing this potential, Google Cloud embraced Cilium, a leading open-source CNI project built on eBPF, to create GKE Dataplane V2 (DPv2). Reaching general availability in May 2021, GKE Dataplane V2 represented a significant leap in GKE’s CNI capabilities: 

  • Enhanced performance and scalability: By leveraging eBPF, CNI can bypass traditional kernel networking paths (like kube-proxy’s iptables-based service routing) for highly efficient packet processing for services and network policy.

  • Built-in network policy enforcement: GKE Dataplane V2 comes with Kubernetes network policy enforcement out-of-the-box, meaning you don’t need to install or manage a separate CNI like Calico solely for policy enforcement when using DPv2.

  • Enhanced observability at the data plane layer: eBPF enables deep insights into network flows directly from the kernel. GKE Dataplane V2 provides the foundation for features like network policy logging, offering visibility into CNI-level decisions.

  • Integrated security in the dataplane: eBPF enforces network policies efficiently and with context-awareness directly within CNI’s dataplane.

  • Simplified operations: As it’s a Google-managed CNI component, GKE Dataplane V2 simplifies operations for Customer workloads.

  • Advanced networking capabilities: Dataplane V2 unlocks a suite of powerful features that were not available or harder to achieve with Data Plane V1. These include:

  • IPv6 and dual-stack support: Enabling pods and services to operate with both IPv4 and IPv6 addresses.

  • Multi-networking: Allowing pods to have multiple network interfaces, connecting to different VPC networks or specialized network attachments, crucial for use cases like cloud native network functions (CNFs) and traffic isolation.

  • Service steering: Providing fine-grained control over traffic flow by directing specific traffic through a chain of service functions (like virtual firewalls or inspection points) within the cluster.

  • Persistent IP addresses for pods: In conjunction with the Gateway API, GKE Dataplane V2 allows pods to retain the same IP address across restarts or rescheduling, which is vital for certain stateful applications or network functions.

GKE Dataplane V2 is now the default CNI for new clusters in GKE Autopilot mode and our recommended choice for GKE Standard clusters, underscoring our commitment to providing a cutting-edge, eBPF-powered network interface.

2024: Scaling new heights for AI Training and Inference

In 2024, we marked a monumental achievement in GKE’s scalability, with the announcement that GKE supports clusters of up to 65,000 nodes. This incredible feat, a significant jump from previous limits, was made possible in large part by GKE Dataplane V2’s robust, highly optimized architecture. Powering such massive clusters, especially for demanding AI/ML training and inference workloads, requires a dataplane that is not only performant but also incredibly efficient at scale. The version of GKE Dataplane V2 underpinning these 65,000-node clusters is specifically enhanced for extreme scale and the unique performance characteristics of large-scale AI/ML applications — a testament to CNI’s ability to push the boundaries of what’s possible in cloud-native computing.

For AI/ML workloads, GKE Dataplane v2 also supports ever-increasing bandwidth requirements such as in our recently announced A4 instance. GKE Dataplane v2 also supports a variety of compute and AI/ML accelerators such the latest GB200 GPUs and Ironwood, Trillium TPUs.

For today’s AI/ML workloads networking plays critical role: AI and machine learning workloads are pushing the boundaries of computing as well as networking, presenting unique challenges for GKE networking interfaces:

  • Extreme throughput: Training large models requires processing massive datasets that demand upwards of terabit networking orchestrated by GKE networking interfaces.

  • Ultra-low latency: Distributed training relies on near-instantaneous communication between processing units.

  • Multi-NIC capabilities: Providing pods with multiple network interfaces, managed by GKE Dataplane V2’s multi-networking capability, can significantly boost bandwidth and allow for network segmentation.

2025 – Beyond CNI: addressing next gen Pod Networking challenges

Dynamic resource allocation (DRA) for networking 

A promising Kubernetes innovation is dynamic resource allocation (DRA). Introduced to provide a more flexible and extensible way for workloads to request and consume resources beyond CPU and memory, DRA is poised to significantly impact how CNIs manage and expose network resources. While initially focused on resources like GPUs, its framework is designed for broader applicability.

In GKE, DRA (available in preview from GKE version 1.32.1-gke.1489001+) opens up possibilities for more efficient and tailored network resource management, helping demanding applications get the network performance they need using the Kubernetes Network Drivers (KNDs).

KNDs use DRA to expose Network resources at the Node level that can be referenced by all the Pod (or all containers). This is particularly relevant for AI/ML workloads, which often require very specific networking capabilities. 

Looking ahead: Innovations shaping the future

The journey doesn’t stop here. With the increased adoption of accelerated workloads driving new architectures on GKE, the demands on Kubernetes networking will continue. One of the reference implementations for the Kubernetes Network Driver is the DRANET project. We look forward to continued discussions with the community and contributions to the DRANET project. We are committed to working with the community to deliver innovative customer centric solutions addressing these new challenges.