Build large-scale AI/ML and HPC clusters with Cluster Toolkit (formerly HPC Toolkit)

Build large-scale AI/ML and HPC clusters with Cluster Toolkit (formerly HPC Toolkit)

The Cloud HPC Toolkit, now rebranded as Cluster Toolkit, simplifies the creation and management of high performance computing environments on Google Cloud. Initially focused on scientific and technical computing workloads, it has expanded to encompass AI/ML applications, reflecting its widespread adoption across various domains.

The Cluster Toolkit empowers users to focus on their workloads by streamlining cluster setup and deployment, leveraging Google Cloud’s best practices, and offering flexibility for diverse computing tasks. Key benefits include:

  • Easy deployment and management of clusters: The Toolkit simplifies the process of setting up and maintaining clusters, allowing users to focus on their workloads rather than infrastructure management. The Toolkit supports multiple schedulers including Slurm, GKE, and Batch.

  • Quickstart options for HPC and AI/ML workloads: The Toolkit has a library of pre-built blueprints and modules that let users begin running their workloads quickly, accelerating time-to-value. 

  • Integration of Google Cloud best practices: The aforementioned blueprints and modules incorporate Google Cloud’s recommended configurations, ensuring that clusters are set up for optimal performance and efficiency.

  • Regular updates and new features: The Toolkit is actively maintained and updated with new features and improvements, providing users with ongoing support and enhancements.

  • Open-source accessibility: The Toolkit is open-source, allowing users to customize and extend its capabilities to meet their specific needs.

What’s new in Cluster Toolkit

In addition to a new name, Cluster Toolkit has several new features for HPC and AI/ML workloads:

  • A3 Mega Blueprint: This blueprint makes it easy to deploy a cluster of A3 Mega VMs ready for training large language models (LLMs) and other AI/ML workloads. Earlier in the year, we also launched the A3 Blueprint.

  • HPC VM Image: This VM Image is pre-installed with popular HPC tools and libraries, ensuring you can begin running your HPC workloads quickly with assured performance. 

  • Slurm-gcp v6: The latest version of the Slurm-gcp solution, which provides a seamless experience for running Slurm workloads on Google Cloud, is now GA. 

Guidelines for existing Toolkit customers

We’ve renamed our GitHub repo to “Cluster Toolkit” and some commands (e.g., ghpc is now gcluster). Existing Git operations and commands will still work, but we strongly recommend updating local clones and command names to avoid confusion.

How to get started

To get started with the Cluster Toolkit, select one of our easy-to-use HPC and AI/ML blueprints, available through our GitHub repo, and use it to set up a cluster. We also offer a variety of resources to help you get started, including documentation, quickstarts, and videos.