Verifying Transactional Consistency with Jepsen
I’d like to give an update on our efforts to verify FaunaDB’s transactional correctness guarantees.
FaunaDB is a mission critical NoSQL database. Most database technologies sound like a joke version of the "good, fast, or cheap" mantra: usable, scalable, or correct, choose two, but more often just choose one, and sometimes you get zero. Instead, FaunaDB endeavors to be all three at once. Today we will talk about becoming correct.
We've worked hard on the transaction engine (a lockless, single-phase distributed preprocessor inspired by Calvin), but we've worked just as hard on the testing infrastructure. As other NoSQL systems like Cassandra and MongoDB found out, it's a lot easier to aspire to operational integrity or scalability than actually have it.
Testing at Fauna
As part of our engineering process at Fauna, we have built a comprehensive suite of unit and property-based tests and as well as a sophisticated distributed testing framework that is capable of checking the behavior of FaunaDB in the presence of a wide combination of fault injections and operational changes, similar to Netflix's Chaos Monkey.
Kyle Kingsbury’s Jepsen has quickly earned a reputation as the industry standard for distributed systems testing. We have now implemented and verified two tests in Jepsen:
- A register test, which checks linearizability over individual records.
- A ledger test, which checks serializability of transactions involving multiple records.
FaunaDB passes these tests.
A Different Kind of Halting Problem
Jepsen started with a simple question: what if the guy mowing the lawn never calls? Or more technically speaking: what happens to production systems when networks fail?
Up until this point, correctness in most distributed systems was purely aspirational, potentially excepting Tandem NonStop. It turned out that essentially no systems did what they said on the tin, due to both implementation defects and because they made claims that physics and information science unfortunately do not allow.
Essentially no systems did what they said on the tin, because they made claims that physics unfortunately does not allow.
Jepsen has since become capable of simulating a much larger variety of machine and network failure scenarios, and of checking a system’s more formal isolation characteristics, such as its ability to maintain sequential consistency, snapshot isolation, linearizability, and liveness in the presence of these failures.
Research has shown that the randomized testing methodology that Jepsen represents is very effective at exposing faults related to partition tolerance. We can have a high degree of confidence in Fauna's distributed correctness if it successfully passes a correctly-implemented set of tests based on randomized failure simulation.
Future ways for us increase our correctness confidence include formal specification and verification like TLA+, dependent typing, etc.
Read the Results
Available now is a report of our internal Jepsen results written by our own engineering team. So far, we have found no significant flaws in our implementation. FaunaDB is capable of preserving strong ACID guarantees in the face of concurrent operations, cluster reconfigurations, and network partitions. And as a CP system, FaunaDB maintains liveness as long as a majority of the cluster remains available.
FaunaDB is capable of preserving strong ACID guarantees in the face of concurrent operations, cluster reconfigurations, and network partitions.
Stay tuned for an independent analysis and an open-source release of the Fauna Jepsen tests in the future. In the meantime, have a look at the report.
Thanks to Brandon Mitchell, Nathan Taylor, Jeff Smick, Attila Szegedi, and the whole Fauna engineering team for their hard work on the Jepsen analysis. Evan Weaver and Dhruv Gupta contributed to this post.
If you enjoyed this topic and want to work on systems and challenges just like this, Fauna is hiring!