Preventing Lost Paychecks: Lessons from the Wells Fargo Datacenter Failure
Recently, Wells Fargo was in the news because of a datacenter outage. Various reports suggested that they had power interruption and fires within the facility. The outage impacted debit card transactions and online banking. However, this was just day one. News from day two revealed that people were reporting missing paychecks. Direct deposits were not showing up in their accounts.
If you are a bank, this scenario is the one you meticulously plan to prevent because the damage to your reputation is far more costly than the damage caused by any individual event. You simply cannot lose your customer’s hard earned money!
If you’re wondering how a bank in this day and age can still be vulnerable to multi-day outages and, more importantly, how they can lose data, we think the answer may lie in often the most overlooked component of IT infrastructure—the database.
It is the database behind the banking system that holds your paycheck information. Once written to the database, that information should never be lost, no matter what the disruption. But as this news suggests, today’s banking systems remain vulnerable to data loss, and I blame the database.
Databases are at Odds with Resiliency
The industry as a whole has been making strides to build resilient systems. Private datacenters and public cloud providers with multiple availability zones offer the options to create multi-node (or distributed) deployments. Application architectures have also evolved at even faster pace to take advantage of these distributed options. In theory, should a zone erupt in fire, your systems should work, almost, until you hit the database.
Unfortunately, most modern databases still make resilience extremely hard to achieve. Legacy transactional databases often employ a centralized architecture. Failover systems are expensive to implement (often requiring specialized hardware and networks) and even then, come with many lines of fineprint. They are also hard to configure and maintain, even within the same datacenter, much less in multi-datacenter scenarios. Moreover, they simply weren’t designed for the Cloud. If you’re a bank adopting Cloud, you’re out of luck—you’re still bottlenecked on the database.
Replatforming Your Information Systems
New global transactional databases like FaunaDB and Google Spanner offer hope for building new systems that are more resilient to disasters. They are designed to operate as globally distributed clusters while offering the same data consistency and isolation guarantees of Oracle and the RDBMS movement. Here are some capabilities to look for in your new database.
Effortless Highly Available Architecture
Look for databases that are built to operate distributed from the ground up. Whether you host the database in your own datacenters, leverage public clouds, or some hybrid mechanism, employ a database that was designed for dispersion of nodes with ease. No matter how you partition your compute resources, no matter your virtualization or container strategy, your database must work with it effortlessly. It should do so without requiring explicit manual work to replicate or partition data, adding additional hardware or software, or spending excessive man hours in configuration.
Multi-region ACID Transactions
Watch out for distributed databases that offer “single-region” or “multi-document” ACID. Constraining ACID in any form is not conforming to ACID at all. You’re back in the same scenario—a regional failure could cause data loss.
To ensure consistency and durability, it isn’t sufficient that your database merely support the ACID primitives of RDBMS systems. More importantly, your database must offer these capabilities in highly partitioned environments to allow for writes to be committed across the entire cluster in real-time. Such “multi-region” ACID capabilities ensure that your applications should not have to be built with the deployment topology in mind.
No matter how you deploy your database nodes, your transactions should just work. Should portions of your infrastructure fail, the database as a whole should continue to offer the same ACID guarantees
If you plan to leverage public cloud options, relying on a multi-vendor strategy greatly diminishes your disaster risk. Probabilistically speaking, there is a much lower chance that both or all three of your cloud vendors will fail across all their availability zones simultaneously. So pick a database that gives you the option to operate across multiple clouds without any retrofitting to your application or infrastructure. Specifically, ensure that your database is not susceptible to divergence in underlying hardware and network characteristics, or clock skew.
Resistance to Chaos
All environments have inherent chaos, and moving to a distributed environment increases your exposure to such chaos. Individual nodes can go out, daily devops procedures can shutdown or restart your VMs, and other minor disruptions and intermittent noise in the network all have the potential to corrupt the state of your database. As long as the database cluster has a quorum of nodes in operational condition, it must continue to operate without impacting your applications. Your database must be resilient to chaos.
Easy Distributed Backup & Recovery
The administrative overhead of operating distributed systems often becomes prohibitive in adopting a distributed approach to your data. Therefore, one of the requirements to adopt such a distributed database is to ensure that it is easy to make node backups, and that your node recovery time is within acceptable bounds. More importantly, your database should allow for easy automation of daily tasks using your existing devops frameworks.
FaunaDB Enables Resilience
FaunaDB is a distributed operational database built to solve the challenges of modern day digital businesses. Inspired by Calvin, FaunaDB is a CP database that is also highly available. It is easy to deploy, easy to operate, and supports multi-cloud environments. You can also use FaunaDB Serverless Cloud, the world’s first multi-cloud serverless database, with nodes on both AWS and GCP.
Some of the salient points of FaunaDB’s architecture include:
- Distributed, Multi-Cloud: FaunaDB is architected to achieve global distribution without complexity. Completely independent from underlying infrastructure to execute distributed transactions, FaunaDB can operate in any combination of datacenter and/or cloud environments to suit your business objectives. You can read more about its patent-pending platform agnostic transaction protocol in this article.
- 100% ACID & Strict Serializability: FaunaDB offers multi-region ACID transactions and provides the strongest levels of transaction isolation in its class. Data once committed is never lost. Be they paycheck deposits or hotel reservations, records are durable irrespective of your partitioning strategy. Concurrent reads and writes adhere to the order in which they were accepted, no matter the chaos conditions such as failure or clock skew. This article by Dr. Daniel Abadi compares Calvin to approaches like Spanner and other derivatives of Spanner.
- High Availability & Scalability: FaunaDB scales horizontally on commodity hardware. Deploy a new node, and go. There are no extra hardware or network parameters to tweak. Check out this demo to see FaunaDB transactions under a datacenter failure.
- Effortless Ops: We made it simple to roll out FaunaDB. We use transactional cluster configuration commands to ensure that you have to perform minimal manual housekeeping. The database should not be an impediment to operations. It should blend in and just work. You have your business to worry about. With each release, our objective is to handle operational automation within the database itself. Here’s how we think of operational simplicity for the modern distributed datacenter.
What Can We Learn from Wells?
If you’re a bank or a financial services company, or any business that operates financial transactions (payments, billing, reservations, ecommerce etc.), upgrading your data plane to support your disaster scenarios is no longer a luxury. To thrive in a digital economy, and to compete in the digital business era, you must transform your systems to withstand unpredictable situations. While downtime is acceptable, data loss is possibly the worse outcome. You lose not only revenue, but also your reputation.
If you enjoyed this topic and want to work on systems and challenges just like this, Fauna is hiring!