From data lakes to user applications: How Bigtable works with Apache Iceberg

From data lakes to user applications: How Bigtable works with Apache Iceberg

The latest version of the Bigtable Spark connector opens up a world of possibilities for Bigtable and Apache Spark applications, not least of which is additional support for Bigtable and Apache Iceberg, the open table format for large analytical datasets. In this blog post, we explore how to use the Bigtable Spark connector to interact with data stored in Bigtable from Apache Spark, and delve into powerful use cases that leverage Apache Iceberg.

The Bigtable Spark connector allows you to directly read and write Bigtable data using Apache Spark in Scala, SparkSQL and DataFrames. This integration gives you direct access to your operational data for building data pipelines that support training ML models, ETL/ELT, or generating real time dashboards. When combined with Bigtable Data Boost, Bigtable’s serverless compute service, you can get high-throughput read jobs on operational data without impacting Bigtable application performance. Apache Spark is commonly used as a processing engine for working with data lakehouses and data stored in open table formats, including Apache Iceberg. We’ve worked to enhance the Bigtable Spark connector for working with data across both Bigtable and Iceberg, including query optimizations such as join pushdowns and support for dynamic column filtering.  

This opens up Bigtable and Apache Iceberg integrations for:

  • Accelerated data science: In the past, Bigtable developers and administrators had to generate datasets for analytics and move them out of Bigtable for analytical processing in tools like notebooks and PySpark. Now, data scientists can directly interact with Bigtable’s operational data within their Apache Spark environments using a combination of both Bigtable and Apache Iceberg data, streamlining data preparation, exploration, analysis, and even the creation of Iceberg tables. When combined with Data Boost, this can be done without any impact to production applications. 

  • Low-latency serving: Write-back capabilities support making real-time updates to Bigtable. This means you can use Iceberg data to create predictions or features in batch and easily serve those features from Bigtable for low-latency online access within an end-user application. 

To get started, you’ll need to add the Bigtable Spark connector dependency to your Apache Spark instance. Next, create a mapping between the Spark data format and Bigtable data formats using JSON. Once this catalog is established, you can read data from Bigtable as a Spark DataFrame with a simple command:

code_block
<ListValue: [StructValue([('code', "records = spark.read rn .format('bigtable') rn .option('spark.bigtable.project.id', bigtable_project_id) rn .option('spark.bigtable.instance.id', bigtable_instance_id) rn .options(catalog=catalog) rn .load()"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3ef3759c1220>)])]>

A write can also be performed directly from an Apache Spark DataFrame object using the following command:

code_block
<ListValue: [StructValue([('code', "input_data = spark.createDataFrame(data)rnrn input_data.write rn .format('bigtable') rn .options(catalog=catalog) rn .option('spark.bigtable.project.id', bigtable_project_id) rn .option('spark.bigtable.instance.id', bigtable_instance_id) rn .option('spark.bigtable.create.new.table', create_new_table) rn .save()"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3ef3759c1e80>)])]>

To get started, follow the Quickstart or read on to learn more about the two use cases outlined above.  

What the Bigtable Spark connector can do for you

Now, let’s take a look at some ways you could put the Bigtable Spark connector into service.

Accelerated data science

Bigtable is designed for throughput-intensive applications, offering throughput that can be adjusted by adding and removing nodes. If you are writing in batch over the Apache Spark connector, you can achieve even more throughput through the use of the spark.bigtable.batch.mutate.size option, which takes advantage of Bigtable’s mutation batching functionality. 

Throughput and queries per second (QPS) can be autoscaled, resized without any restarting, and the data is automatically replicated for high availability and faster region-specific access. There are also specialized data types that make it easy to build distributed counters, which can give you up-to-date metrics on what is happening in your system. 

Conversely, Apache Iceberg is a high-performance open-source table format for large analytical datasets. Iceberg lets you build analytics tables, often with aggregated data, that can be shared across engines such as Apache Spark and BigQuery

Customers have found that event collection in Bigtable with advanced analytics of those events using Apache Spark and Apache Iceberg can be a powerful combination. For example, you may want to collect clicks, views, sensor readings, device usage, gaming activity, engagement, or other telemetry in real time, and have a view of what is happening in the system using Bigtable’s continuous materialized views. You might then use Apache Spark’s batch processing and ML capabilities and even join with historical Iceberg data to run advanced analytics and understand the trends over time, identify anomalies, or generate machine learning models on the data. When these advanced analytics in Apache Spark are done using a Data Boost application profile, this analysis can be done without impacting real-time data collection and operational analytics.

Low-latency serving: Bigtable for model serving of BigQuery Iceberg Managed Tables 

Apache Iceberg provides an efficient way to combine and manage large datasets for machine learning tasks. By storing your data in Iceberg tables, multiple engines can write to the same warehouse and leverage Spark or BigQuery to train and evaluate the ML models. Once you have a trained model, you often need to publish feature tables or feature vectors into a low-latency database for online application access. 

Bigtable is well suited for low-latency applications that require lookups against these large-scale datasets. Let’s say you have a dataset of customer transactions stored across multiple Iceberg tables. You can use SparkSQL to combine this data and SparkML to train a fraud detection model on this data. Once the model is trained, you can use it to predict the probability of fraud for new transactions. You can then write these predictions back to Bigtable using the Bigtable Spark connector, where they can be accessed by your fraud detection application.

Use case: Vehicle telemetry using Bigtable and the Apache Spark connector

Let’s look at an abbreviated example of how Bigtable and the Apache Spark connector might work together for a company that is tracking vehicle telemetry and wants to enable their fleet managers with immediate access to real-time KPIs of equipment effectiveness, while also allowing data scientists to build a predictive maintenance schedule that they can provide to drivers.

While this specific use case relies on vehicles as a case study, it is a generally applicable architecture pattern that can be used for a variety of telemetry and IOT use cases ranging from measuring telecommunications equipment reliability to building KPIs for Overall Equipment Effectiveness (OEE) in a manufacturing operation.

image1

Let’s take a look at the various components of this architecture.

  1. Bigtable is an excellent choice for the high-throughput, low-latency writes that are often required for telemetry data, where vast amounts of data are continuously streamed in. With telemetry data, the data schema changes often, requiring a flexible schema that Bigtable provides. Bigtable clusters can be deployed throughout the globe with different autoscaling configurations that can match the local demand for writes. The ingested data is automatically replicated to all clusters, giving you a single unified view of the data. There are also open-source streaming connectors for both Apache Kafka and Apache Fink, as well as industry-specific connectors such as NATS for automotive data. 
  2. Bigtable continuous materialized views offer real-time data transformations and aggregations on streaming data, enabling vehicle managers to gain immediate insights into their fleet’s activity and make data-driven adjustments.
  3. Keeping all data within Bigtable facilitates advanced analytics on historical information using Apache Spark. Data scientists can directly access this data in Apache Spark using the Bigtable Spark connector without needing to create copies. Furthermore, Bigtable Data Boost enables the execution of large batch or machine learning jobs, such as training predictive models or generating comprehensive reports, without impacting the performance of live applications. These jobs can involve joining streaming event data (e.g., real-time vehicle telemetry like GPS coordinates, speed, engine RPM, fuel consumption, or acceleration/braking patterns) with historical or static datasets stored in Apache Iceberg (e.g., vehicle master data including make, model, year, VIN, vehicle type, maintenance history, or driver assignments). Apache Iceberg may also include additional data sources such as weather and traffic analysis. This allows for richer insights, such as correlating specific driving behaviors with maintenance needs, predicting component failures based on operational data, or optimizing routes by combining real-time traffic with vehicle capacity and destination information. You can also provide analytics teams with secure Bigtable data access through Bigtable Authorized Views to limit data access to sensitive information like GPS. 
  4. Machine learning-driven insights, such as predictive maintenance recommendations that are often generated in batch processes and potentially stored in Iceberg tables, can be written back to Bigtable using the Bigtable Spark connector. This makes these valuable insights immediately accessible to user-facing applications. 
  5. Bigtable excels at high-scale reads in user-facing applications for this vehicle application thanks to its distributed architecture and design that’s optimized for massive, time-series data. It can handle billions of rows and thousands of columns. Bigtable can quickly retrieve this data with low latency because it distributes data across many nodes and performs fast, single-row lookups and efficient range scans, helping to ensure a smooth and responsive user experience even with millions of vehicles constantly streaming data.

Igniting the spark

The Bigtable Spark connector, combined with the recent connector enhancements for Apache Iceberg and Bigtable Data Boost, unlocks new possibilities for large-scale data processing on operational data. Whether you’re training ML models or performing serverless analytics, this powerful combination can help you implement new use cases and ease the operational burden of running complex ETL jobs. By leveraging the scalability, performance, and flexibility of these technologies, you can build robust and efficient data pipelines that can handle your most demanding workloads.

On Google Cloud, Dataproc Serverless simplifies running Apache Spark batch workloads by removing the need to manage clusters. When processing data via Bigtable’s serverless Data Boost, these jobs become highly cost-effective: you only pay for the precise amount of processing power you consume and solely for the duration your workload is running, without needing to configure any compute infrastructure.

To get started, follow the Quickstart or learn more about Bigtable for your low-latency analytics workloads.