Fireworks.ai: Lighting up gen AI through a more efficient inference engine

December 3, 2024

By

Google Cloud Blog

Enterprises across industries are investing in AI technologies to move faster, be more productive, and give their customers the products and services that they need. But moving AI from prototype to production isn’t easy. That’s why we created Fireworks AI.

The story of Fireworks AI started seven years ago at Meta AI, where a group of innovators worked on PyTorch — an ambitious project building leading AI infrastructure from scratch. Today, PyTorch is one of the most popular open-source AI frameworks, serving trillions of inferences daily.

Many companies building AI products struggle to balance total cost of ownership (TCO) with performance quality and inference speed, while transitions from prototype to production can also be challenging. Leaders at PyTorch saw a tremendous opportunity to use their years of experience to help companies solve this challenge. And so, Fireworks AI was born.

Fireworks AI delivers the fastest and most efficient gen AI inference engine to date. We’re pushing the boundaries with compound AI systems, which replace more traditional single AI models with multiple interacting models. Think of a voice-based search application that uses audio recognition models to transcribe questions and language models to answer them.

With support from partners like NVIDIA and their incredible CUDA and CUTLASS libraries, we’re evolving fast so companies can start taking their next big steps into genAI.

Here’s how we work with Google Cloud to tackle the scale, cost, and complexity challenges of GenAI.

Matching customer growth with scale

Scale is a primary concern when moving into production, because AI moves fast. Fireworks’ customers might develop new models that they want to roll out right away or find that their demand has doubled overnight, so we need to be able to scale quickly and immediately.

While we’re building state-of-the-art infrastructure software for gen AI, we look to top partners to provide architectural components for our customers. Google Cloud’s engineering strength provides an incredible environment for performance, reliability, and scalability. It’s designed to handle high-volume workloads while maintaining excellent uptime. Currently, Fireworks processes over 140 billion tokens daily with 99.99% API uptime, so our customers never experience interruptions.

Google Kubernetes Engine (GKE) and Compute Engine are also essential to our environment, helping us run control plane APIs and manage the fleet of GPUs.

Google Cloud offers us outstanding scalability so that we’re always only using right-sized infrastructure. When customers need to scale, we can instantly meet their requests.

Since Fireworks is a member of the Google for Startups program, Google Cloud provided us with credits that were essential for growing our operations.

Stopping runaway costs of AI

Scale isn’t the only thing companies need to worry about. Costs can balloon overnight after deploying AI, and enterprises need efficient ways to scale to maintain sustainable growth. By analyzing performance and environments, Fireworks can help them balance scale and efficiency.

We use Cloud Pub/Sub and Cloud Functions for reporting and billing event processing, and Cloud Monitoring for logging analytics and alerting metrics for analytics. All the request and billing data is then stored in BigQuery, where we can analyze use and volumes for each customer model. It helps us determine if we have extra capacity, if we need to scale, and by how much.

Google Cloud’s blue-chip cloud environment also allows us to provide more to our customers without breaking budgets. Because we can offer 4X lower latency and 4X higher throughput compared to competing hosted services, we provide better performance for reduced prices. Customers then won’t need to swell their budget to increase performance, keeping TCO down.

The right environment for any customer

Every genAI solution has its own complexities and nuances, so we need to remain flexible to tailor the environment for each customer. Some enterprises might need different GPUs for different parts of a compound AI system, or they might want to deploy smaller fine-tuned models alongside larger models. Google Cloud gives us the freedom to split up tasks and use any GPUs that we need, as well as integrate with a diverse range of models and environments.

This is especially important when it comes to data privacy and security concerns for customers in sensitive industries such as finance and healthcare. Google Cloud provides robust security features like encryption and secure VPC connectivity, and it helps comply with compliance statutes such as HIPAA and SOC 2.

Meeting our customers where they are – which is a moving target – is critical to our success in gen AI. Companies like Google Cloud and NVIDIA help us do just that.

Powering innovation in gen AI

Our philosophy is that enterprises of all sizes should be able to experiment with and build AI products. AI is a powerful technology that can transform industries and help businesses compete on a global scale.

Keeping AI open source and accessible is paramount, and that’s one of the reasons we continue to work with Google Cloud. With Google Cloud, we can enable more companies to drive value from innovative uses of gen AI.