I’d like to give an update on our efforts to verify Fauna’s transactional correctness guarantees.
Fauna is a mission critical NoSQL database
. Most database technologies sound like a joke version of the "good, fast, or cheap" mantra: usable, scalable, or correct, choose two, but more often just choose one, and sometimes you get zero. Instead, Fauna endeavors to be all three at once. Today we will talk about becoming correct.
We've worked hard on the transaction engine (a lockless, single-phase distributed preprocessor inspired by Calvin
), but we've worked just as hard on the testing infrastructure. As other NoSQL systems like Cassandra and MongoDB found out, it's a lot easier to aspire to operational integrity or scalability than actually have it.
Testing at Fauna
As part of our engineering process at Fauna, we have built a comprehensive suite of unit and property-based tests and as well as a sophisticated distributed testing framework that is capable of checking the behavior of Fauna in the presence of a wide combination of fault injections and operational changes, similar to Netflix's Chaos Monkey
Kyle Kingsbury’s Jepsen
has quickly earned a reputation as the industry standard for distributed systems testing. We have now implemented and verified two tests in Jepsen:
- A register test, which checks linearizability over individual records.
- A ledger test, which checks serializability of transactions involving multiple records.
Fauna passes these tests.
A Different Kind of Halting Problem
Up until this point, correctness in most distributed systems was purely aspirational, potentially excepting Tandem NonStop. It turned out that essentially no systems did what they said on the tin, due to both implementation defects and because they made claims that physics and information science unfortunately do not allow.
Essentially no systems did what they said on the tin, because they made claims that physics unfortunately does not allow.
Jepsen has since become capable of simulating a much larger variety of machine and network failure scenarios, and of checking a system’s more formal isolation characteristics, such as its ability to maintain sequential consistency, snapshot isolation, linearizability, and liveness in the presence of these failures.
Research has shown
that the randomized testing methodology that Jepsen represents is very effective at exposing faults related to partition tolerance. We can have a high degree of confidence in Fauna's distributed correctness if it successfully passes a correctly-implemented set of tests based on randomized failure simulation.
Future ways for us increase our correctness confidence include formal specification and verification like TLA+
, dependent typing, etc.
Read the Results
Available now is a report of our internal Jepsen results
written by our own engineering team. So far, we have found no significant flaws in our implementation. Fauna is capable of preserving strong ACID guarantees in the face of concurrent operations, cluster reconfigurations, and network partitions. And as a CP system, Fauna maintains liveness as long as a majority of the cluster remains available.
Fauna is capable of preserving strong ACID guarantees in the face of concurrent operations, cluster reconfigurations, and network partitions.
Stay tuned for an independent analysis and an open-source release of the Fauna Jepsen tests in the future. In the meantime, have a look at the report
Thanks to Brandon Mitchell, Nathan Taylor, Jeff Smick, Attila Szegedi, and the whole Fauna engineering team for their hard work on the Jepsen analysis. Evan Weaver and Dhruv Gupta contributed to this post.