AlloyDB for PostgreSQL under the hood: Business continuity

One of the most significant advantages of running databases in the cloud is the ability to ensure business continuity. In a database context, it’s common to define continuity as a combination of database availability and data durability.

AlloyDB is a fully-managed, PostgreSQL-compatible database for demanding transactional workloads. AlloyDB employs rigorous business continuity and data resilience practices by default. This means you can benefit from features like automatic failover, point-in-time recovery, and data replication without having to worry about complex and time-consuming configurations and management.

At Google Cloud, we understand the importance of business continuity, which is why we have designed our infrastructure and managed services to address many fault conditions that are most dreaded in self-managed deployments. By default, our architecture and infrastructure provide robust protection against data loss and unavailability events, which means that you can have peace of mind knowing that your data is always safe and accessible. At the same time, because different applications have different availability needs, we strive to provide flexibility and controls to you so that you can meet your own service level objectives.

In particular, AlloyDB is built on top of Google Cloud’s highly available and reliable infrastructure, which means it can provide high database availability and durability. With AlloyDB, you can ensure that your data is available and that you can recover from any disruptions quickly and efficiently.

Unavailability and data loss events

Mission critical enterprise applications require always available and resilient databases and their data to be durable. Let’s dive deeper into these terms and understand which unavailability and data loss events they relate to.

Availability is the ability for end-users or applications to access the database and perform data management operations. Application availability is usually dependent on database availability. When the database is down, users may not be able to login, see your product catalog, place orders or receive relevant offers.

Some of the most salient unavailability events are infrastructure failures. These can be as minor as a single hardware component failing, to more generalized events, such as power outages or disasters that can take down a whole data center or a set of data centers in a region. Application errors can also render database systems unavailable. For example, a poorly written query can keep a whole table in a lock for extended periods, or mistakes in configuration can cause unavailability of the database server.

Durability is the ability to retain data without corruption until it is retrieved and is an integral attribute of any database. Similar to unavailability events, certain infrastructure failures, like a disk failure, can also lead to data loss events. However, the most dreaded data loss events include user and application errors, such as accidental deletion of tables, or an application corrupting records before making them persistent in a database.

Resilience is the ability of a system to recover from failures it experiences and restore normal operations, and goes hand in hand with availability and durability. Highly available and highly durable systems are built by introducing redundancy and fault tolerance, thereby protecting the users against failure events.

Let’s see how Google technologies and the AlloyDB architecture allow you to achieve high levels of availability and durability.

Resilient by design

AlloyDB’s architecture leverages Google’s foundational storage system, Colossus. Colossus is one of the key building blocks of planet-scale, mission-critical services such as Google Search and Gmail. It is a distributed file system designed to be infinitely scalable and highly fault-tolerant. Two techniques underpin its fault tolerance: redundancy and erasure coding.

The redundancy technique used in Colossus involves splitting data into smaller chunks and copying them onto multiple servers in independent failure domains. This approach ensures that any copy that is impacted by an unavailability event, such as a network partition, or a data loss event, such as a disk failure, can be reconstructed from the remaining copies. Erasure coding is a technique that introduces additional data redundancy within the chunks, allowing regeneration of missing or corrupted data from the known blocks within the same chunk.

Beyond leveraging Colossus, AlloyDB’s intelligent storage service introduces additional fault tolerance. Data is stored redundantly in multiple zones for tolerance to zonal failures. Our storage also monitors outages in any of its zonal copies, providing the ability to reconstruct them to maintain its high availability. This means that you can be confident that your data will be safe and always available, even in the face of catastrophic failure.

AlloyDB also uses Google’s Virtual Machine (VM) designs, which support live migration in any of the underlying VMs, including your database instances, for your AlloyDB clusters. Live migration lets Google Cloud perform maintenance without interrupting a workload, rebooting a VM, or modifying any of the VM’s properties, such as IP addresses, metadata, block storage data, application state, and network settings. This allows AlloyDB to maintain its availability during infrastructure maintenance, security and configuration patching, and even hardware failures.

Add-on features to increase availability and durability

While AlloyDB’s building blocks discussed above already offer a high level of resilience, your mission critical applications can benefit from additional AlloyDB features that enhance availability and durability.

High availability instances

AlloyDB offers high availability instances with an industry-leading 99.99% availability SLA, inclusive of maintenance. Primary instances are highly available by default, and read pools with two or more nodes are also highly available.

A key feature of AlloyDB’s high availability instances is automatic failovers. AlloyDB automatically detects unhealthy PostgreSQL instances and fails over to standby machines in different zones within 60 seconds, independent of the database size and load. This ensures that your application experiences minimal disruption during unavailability events.

AlloyDB’s high availability instances also support non-disruptive maintenance operations for user-driven maintenance operations, such as instance resizing, reconfigurations that require restarts, and service driven maintenance operations, like minor PostgreSQL version upgrades. In preparation for these maintenance events, we prepare your new instance, and warm up its caches before replacing the current instance with the new instance. Maintenance events are completed with less than 10 seconds of disruption on the primary instance while the read pools remain fully operational during maintenance.

Under the hood, AlloyDB’s high availability instances employ the same battle-tested Google Cloud services that we offer to our customers for building highly available applications. Regional Internal Load Balancers route connections seamlessly between machines, ensuring that your application is always connected to a healthy instance. Managed Instance Groups instantiate replacement machines rapidly, and Spanner persists instance configurations, making maintenance and scale-out operations fast and reliable.

Cross-region replication

Although simultaneous failures of multiple data centers in the same cloud region are vanishingly rare (typically requiring disastrous events such as fires or earthquakes), AlloyDB adheres to the principle of planning for all possible failure scenarios — even for rare but large-scale events.

Cross-region replication allows you to replicate your data asynchronously from your primary cluster into a secondary cluster in another cloud region. In case of a primary cluster failure or outage, you can quickly promote the secondary cluster to become the new primary cluster, allowing your application to continue running with minimal disruption.

The cross-region replication feature operates asynchronously, which means that it doesn’t impact the write performance of your primary cluster. The secondary cluster receives updates in near-real-time, and you can monitor the lag.

Backups and point-in-time recovery

Data loss can be a major setback for businesses, which is why AlloyDB offers robust backup and point-in-time recovery capabilities to protect your valuable data. You can take manual, automatic or continuous backups of your AlloyDB clusters to protect your data against data loss events.

Backups are offloaded to AlloyDB’s aforementioned intelligent storage service, which allows backup operations to proceed without any impact on the read or write performance of instances attached to the cluster. The sharded nature of the storage system allows parallel processing of taking and restoring backups, which speeds up backup and recovery times significantly.

Backups are stored in Cloud Storage, separate from the cluster storage, and enjoy some of the highest durability in enterprise storage systems, with 99.999999999% (11 9s) annual durability. This means that a typical database with daily backups may run for hundreds of thousands of years without experiencing corruption – so if your backups start exhibiting problems just after 10,000 years, give us a call! Backups are also independent from cluster lifecycle, offering protection against accidental cluster deletions.

AlloyDB offers automated backups for convenient scheduling and retention of backups. Rich customization parameters allow tailoring this feature to your organization’s specific needs, taking backups as frequently as each hour to as seldom as once a week, retaining based on a time period up to a year or by a count, and the ability to use any Customer Managed Encrypted Key (CMEK) or Google’s default encryption.

Earlier this month, we announced the general availability of continuous backup and recovery for AlloyDB, which provides point-in-time recovery capability within its recovery window, for up to 35 days. Use cases include restoring just prior to an error that caused corruption in the data or to a specific date and time for auditing or for test and development purposes. Continuous backup and recovery combines a daily backup plan and continuous saving of transaction logs. By default, all primary clusters have a continuous backup plan with 14 days of recovery window to ensure your data is protected.

Similar to backups, continuous backup and recovery is also managed by AlloyDB’s intelligent storage system, which means it doesn’t impact instance performance and achieves faster point-in-time recovery compared to traditional point-in-time recovery of PostgreSQL (due to its ability to parallelize replay). Transaction logs are offloaded to Cloud Storage, not impacting your database’s cluster storage.

Continuous backup and recovery is priced by the storage used by the transaction logs and backups. The first seven days of log storage are provided at no additional cost.

Build highly resilient applications with AlloyDB

We understand how crucial high availability and data durability are for your business. Google Cloud’s infrastructure and managed services provide a strong foundation for ensuring business continuity, and AlloyDB builds upon this foundation with its intelligent storage service and additional features to increase availability and durability. By leveraging these features, you can have peace of mind knowing that your mission-critical applications and data should be available, even in the face of infrastructure failures, data loss events, and other disruptions.

Build your application today on AlloyDB. New customers can try AlloyDB for free.