Delta Lake is an open-source optimized storage layer that provides a foundation for tables in lake houses and brings reliability and performance improvements to existing data lakes. It sits on top of your data lake storage (like cloud object stores) and provides a performant and scalable metadata layer on top of data stored in the Parquet format.
Organizations use BigQuery to manage and analyze all data types, structured and unstructured, with fine-grained access controls. In the past year, customer use of BigQuery to process multiformat, multicloud, and multimodal data using BigLake has grown over 60x. Support for open table formats gives you the flexibility to use existing open source and legacy tools while getting the benefits of an integrated data platform. This is enabled via BigLake — a storage engine that allows you to store data in open file formats on cloud object stores such as Google Cloud Storage, and run Google-Cloud-native and open-source query engines on it in a secure, governed, and performant manner. BigLake unifies data warehouses and lakes by providing an advanced, uniform data governance model.
This week at Google Cloud Next ’24, we announced that this support now extends to the Delta Lake format, enabling you to query Delta Lake tables stored in Cloud Storage or Amazon Web Services S3 directly from BigQuery, without having to export, copy, nor use manifest files to query the data.
Why is this important?
If you have existing dependencies on Delta Lake and prefer to continue utilizing Delta Lake, you can now leverage BigQuery native support. Google Cloud provides an integrated and price-performant experience for Delta Lake workloads, encompassing unified data management, centralized security, and robust governance. Many customers already harness the capabilities of Dataproc or Serverless Spark to manage Delta Lake tables on Cloud Storage. Now, BigQuery’s native Delta Lake support enables seamless delivery of data for downstream applications such as business intelligence, reporting, as well as integration with Vertex AI. This lets you do a number of things, including:
-
Build a secure and governed lakehouse with BigLake’s fine-grained security model
-
Securely exchange Delta Lake data using Analytics Hub
-
Run data science workloads on Delta Lake using BigQuery ML and Vertex AI
How to use Delta Lake with BigQuery
Delta Lake tables follow the same table creation process as BigLake tables.
Required roles
To create a BigLake table, you need the following BigQuery identity and access management (IAM) permissions:
-
bigquery.tables.create
-
bigquery.connections.delegate
Prerequisites
Before you create a BigLake table, you need to have a dataset and a Cloud resource connection that can access Cloud Storage.
Table creation using DDL
Here is the DDL statement to create a Delta lake Table
- code_block
- <ListValue: [StructValue([('code', 'CREATE EXTERNAL TABLE `PROJECT_ID.DATASET.DELTALAKE_TABLE_NAME`rnWITH CONNECTION `PROJECT_ID.REGION.CONNECTION_ID`rnOPTIONS (rn format ="DELTA_LAKE",rn uris=['DELTA_TABLE_GCS_BASE_PATH']);'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e649c32fdc0>)])]>
Querying Delta Lake tables
After creating a Delta Lake BigLake table, you can query it using GoogleSQL syntax, the same as you would a standard BigQuery table. For example:
- code_block
- <ListValue: [StructValue([('code', 'SELECT FIELD1, FIELD2 FROM `PROJECT_ID.DATASET.DELTALAKE_TABLE_NAME`'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e649c32f370>)])]>
You can also enforce fine-grained security at the table level, including row-level and column-level security. For Delta Lake tables based on Cloud Storage, you can also use dynamic data masking.
Conclusion
We believe that BigQuery’s support for Delta Lake is a major step forward for customers building lakehouses using Delta Lake. This integration will make it easier for you to get insights from your data and make data-driven decisions. We are excited to see how you use Delta Lake and BigQuery together to solve their business challenges. For more information on how to use Delta Lake with BigQuery, please refer to the documentation.
Acknowledgments: Mahesh Bogadi, Garrett Casto, Yuri Volobuev, Justin Levandoski, Gaurav Saxena, Manoj Gunti, Sami Akbay, Nic Smith and the rest of the BigQuery Engineering team.