Scaling to zero on Google Kubernetes Engine with KEDA

Scaling to zero on Google Kubernetes Engine with KEDA

For developers and businesses that run applications on Google Kubernetes Engine (GKE), scaling deployments down to zero when they are idle can offer significant financial savings. GKE’s Cluster Autoscaler efficiently manages node pool sizes, but for applications that require complete shutdown and startup (scaling the node pool all the way to and from zero), you need an alternative, as GKE doesn’t natively offer scale-to-zero functionality. This is important for applications with intermittent workloads or varying traffic patterns. 

In this blog post, we demonstrate how to integrate the open-source Kubernetes Event-driven Autoscaler (KEDA) to achieve this. With KEDA, you can align your costs directly with your needs, paying only for the resources consumed.

aside_block
<ListValue: [StructValue([('title', '$300 in free credit to try Google Cloud containers and Kubernetes'), ('body', <wagtail.rich_text.RichText object at 0x3e2cfdeddf10>), ('btn_text', 'Start building for free'), ('href', 'http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com'), ('image', None)])]>

Why scale to zero?

Minimizing costs is a primary driver for scaling to zero, and applies to a wide variety of scenarios. For technical experts, this is particularly crucial when dealing with:

  • GPU-intensive workloads: AI/ML workloads often require powerful GPUs, which can be expensive to keep running even when idle.

  • Applications with predictable downtime: Internal tools with specific usage hours — scale down resources for applications used only during business hours or specific days of the week.

  • Seasonal applications: Scale to zero during the off-season for applications with predictable periods of low activity.

  • On-demand staging environments: Replicate production environments for testing and validation, scaling them to zero after testing is complete.

  • Development, demo and proof-of-concept environments:
      • Short-term demonstrations: Showcase applications or features to clients or stakeholders, scaling down resources after the demonstration.

      • Temporary proof-of-concept deployments: Test new ideas or technologies in a live environment, scaling to zero after evaluation.

      • Development environment: Spin up resources for testing, code reviews, or feature branches and scale them down to zero when not needed, optimizing costs for temporary workloads.

  • Event-driven applications:
      • Microservices with sporadic traffic: Scale individual services to zero when they are idle and automatically scale them up when requests arrive, optimizing resource utilization for unpredictable traffic patterns.

      • Serverless functions: Execute code in response to events without managing servers, automatically scaling to zero when inactive.

  • Disaster recovery and business continuity: Maintain a minimal set of core resources in a standby state, ready to scale up rapidly in case of a disaster, minimizing costs while ensuring business continuity.

Introducing KEDA for GKE

KEDA is an open-source, Kubernetes-native solution that enables you to scale deployments based on a variety of metrics and events. KEDA can trigger scaling actions based on external events such as message queue depth or incoming HTTP requests. And unlike the current implementation of Horizontal Pod Autoscaler (HPA), KEDA supports scaling workloads to zero, making it a strong choice for handling intermittent jobs or applications with fluctuating demand.

Use cases

Let’s explore two common scenarios where KEDA’s scale-to-zero capabilities are beneficial:

1. Scaling a Pub/Sub worker

  • Scenario: A deployment processes messages from a Pub/Sub topic. When no messages are available, scaling down to zero saves resources and costs.

  • Solution: KEDA’s Pub/Sub scaler monitors the message queue and triggers scaling actions accordingly. By configuring a ScaledObject resource, you can specify that the deployment scales down to zero replicas when the queue is empty.

2. Scaling a GPU-dependent workload, such as an Ollama deployment for LLM serving

  • Scenario: An Ollama-based large language model (LLM) performs inference tasks. To minimize GPU usage and costs, the deployment needs to scale down to zero when there are no inference requests.

  • Solution: Combining HTTP-KEDA (a beta feature of KEDA) with Ollama enables scale-to-zero functionality. HTTP-KEDA scales deployments based on HTTP request metrics, while Ollama serves the LLM.

Get started with KEDA on GKE

KEDA offers a powerful and flexible solution for achieving scale-to-zero functionality on GKE. By leveraging KEDA’s event-driven scaling capabilities, you can optimize resource utilization, minimize costs, and improve the efficiency of your Kubernetes deployments. Please remember to validate usage scenarios as scale to zero mechanism can influence workload performance. Scaling to zero can increase latency due to cold starts. When an application scales to zero, it means there are no running instances. When a request comes in, a new instance has to be started, increasing latency. 

There are also considerations about state management. When instances are terminated, any in-memory state is lost.

To help you get started quickly, we’ve published a guide for scaling GKE to zero with KEDA, with specific steps for scaling a Pub/Sub worker to zero with KEDA on GKE and scaling a Ollama to zero with KEDA on GKE

To learn more about KEDA and its various scalers, refer to the official KEDA documentation at https://keda.sh/docs/latest.