Fauna implements a unique distributed transaction engine (DTE) that delivers a strongly consistent database across geographically dispersed regions. It allows for distributed reads and writes with low latency while being inherently highly available with no single points of failure. This article looks under the hood to outline the most significant components of the Fauna database service. It also illustrates why Fauna is uniquely positioned to support the most demanding transactional applications that must be distributed and performant, providing near real-time responsiveness for customers worldwide.
A flexible, layered system
Fauna separates database services into logical layers, represented by nodes in a deployment topology. These nodes know the entire topology and can forward requests to other regions if a local node fails to respond. Fauna’s distributed architecture is based on the Calvin
transaction protocol and adds additional innovations. The resulting protocol provides transaction scheduling and data replication layers that use a deterministic ordering guarantee to reduce contention costs associated with distributed transactions. It guarantees that every replica sees the same log of transactions, a final state equivalent to executing the transactions in this log sequentially, and a final state equivalent to every other replica. Fauna adds an optimized data storage layer and various other components, battle tested by over 100,000 databases over the years.
Fig. 1. Fauna’s distributed transaction engine (DTE) Layers.
As client requests come into the system, the service leverages a highly available DNS service with latency-based forwarding to route to nodes in the nearest zone or region. A request is handled at ingress by a controller that can defensively throttle bursts or if traffic exceeds rate limits or provisioned capacity. A service mesh component maps and caches incoming database keys to the region where the database is located. Developers use a single endpoint for their requests, regardless of where their data resides, so applications do not need to find the ideal replica to query.
Query coordination layer
Query coordinators are stateless, horizontally scalable nodes that perform a pre-computation step, computing the inputs and effects of each transaction in advance. This is a vital characteristic of the Calvin protocol, removing the need for pre-transaction locks or write intents. The coordinator selects a snapshot time, inspects the transaction associated with the requests, and optimistically executes it without committing writes. The resulting flattened set of reads is checked for contention; if no writes are included, it returns the query results to the requesting client. The transaction is passed to the transaction logging nodes if a write needs to be committed.
Transaction logging layer
As the name suggests, the transaction logging layer acts as a global write-ahead log split into segments. Each segment implements an optimized Raft
consensus algorithm, which provides redundancy and geographic distribution across replicas. A segment leader node is selected, and the remaining nodes forward transactions to it. The leader builds transaction batches on an epoch interval. The leader communicates with other leaders to agree on all the transactions in the batch. Once the batch has been replicated, its transactions are considered optimistically committed, although their write effects still need to be applied. When all segments have committed the batches, the epoch’s transactions are sent to the data storage layer.
Unlike other distributed database services, Fauna does not need real-time global clock synchronization to guarantee correctness. Segment leader log nodes are the only ones generating epochs, minimizing cross-replica coordination. Log nodes' clocks are synchronized via NTP, ensuring that epochs are generated at about the same time. A timestamp is applied to every transaction reflecting its real commit time (within milliseconds of real time) and its logical, strictly serializable order in relation to other transactions. To scale and optimize the throughput of the global log, epoch intervals can be configured, and the number of segments can be increased.
Data storage layer
All log nodes have a persistent connection to data storage nodes in each replica. Data storage nodes are assigned ranges of keys for which they are responsible, with the complete keyspace represented. All data is stored in each zone or region, with every document redundantly stored on at least three nodes. Storage nodes listen to newly committed transactions for the ranges of keys it covers, and it validates that no values read during the transaction execution have changed between the previous snapshot time and the final commit time. Nodes can communicate with peers if they need the state of values they do not cover.
If the read values do not conflict, the storage node updates the values it covers, and the query coordinator is notified of the transaction’s success. If conflicts arise, the storage node drops the writes and informs the coordinator of the transaction failure. Since the set of checks on the set is deterministic, either all nodes apply the transaction or all nodes will fail it. When a commit is successful in at least one storage node, the result is sent to the client.
Documents are never overwritten; a new version is inserted into the document history as a create, update, or delete event. Version retention policies are configurable by the database. This enables temporal queries so that all transactions can be executed consistently at any point in the past. This is useful for auditing, rollback, cache coherency, and synchronization with other systems. Data administrators can then fix inconsistencies. It also facilitates event streaming, where external systems can be notified when documents or collections are changed in near-real-time.
Critical background tasks like index builds often cause availability issues in other systems. Fauna’s background work is managed by a journaled, topology-aware scheduler similar to Hadoop YARN
to mitigate this. The task system leverages the DTE for strongly consistent storage of the task state and coordination between nodes. If a node fails, its tasks are automatically reassigned to eligible healthy nodes.
Files are stored in logical levels, and index structures are kept in memory to reduce the need to seek through each level file to find a data item. Files on disk are sorted string tables, LZ4
compressed, and compacted when the number of levels exceeds a fixed size. Compaction tasks perform an incremental merge-sort of the contents on a level file batch, evict expired and deleted data, and emit a new combined file. These optimizations optimize I/O and disk usage to maintain predictable, high-performance reads.
DTE: Distributed, performant, and highly consistent
In comparison to common, single-node regional databases with lesser consistency guarantees, Fauna provides superior real-world performance, higher availability in failure scenarios, and horizontal scaling without sacrificing transactional guarantees or relational features. Further, when compared to newer databases designed for distributed workloads, Fauna does not require specialized hardware like atomic clocks, making it fully cloud-agnostic. Because of its optimistic controls and its strictly serializable data isolation guarantees, Fauna delivers a low-latency, highly consistent system regardless of transaction mix, avoiding expensive operations like traditional two-phase commits. Its layered architecture allows for a clean separation of concerns, improving its ability to autoscale. Many customers have experienced this “fire and forget” for themselves — we encourage you to test Fauna and find out for yourself.