Introducing Automated Log Topology in FaunaDB 2.6
One of the most exciting features of the FaunaDB 2.6 release is the automated log topology. Prior to this release, setting up the log topology was a manual process where an enterprise operator had to decide on a log topology upfront, and then repeat that in the FaunaDB configuration file on each node within and across a replica. The operator also had to be educated on how to associate a log segment with a node and ensure consistency in configuration across all replicas in the cluster. Updating this configuration was not straightforward either. In this release, all the inefficiencies around the log setup go away. It is now completely automated and transparent to the operator. In the sections below, we discuss how log topology is set up and how transaction logging works in general.
Transactions in FaunaDB
FaunaDB guarantees fully ACID distributed transactions and offers strictly serializable operations across replicas. This means that the system can process many transactions in parallel, but the final result is equivalent to processing them in serial order. FaunaDB’s Calvin inspired transaction protocol decides the serial order prior to performing any writes to the database. Since transaction application is deterministic in FaunaDB, the system has to ensure every data node that applies writes to itself sees the same transactions in the same order. Each batch of transactions in an epoch is inserted into the transaction log. The FaunaDB execution engine ensures that the final result of processing every batch of transactions is equivalent to processing each transaction individually and in a serialized order. For more details on how transactions are processed, refer to the blog post on “Fauna Transaction Protocol”.
Each transaction in the log is associated with a unique timestamp that approximates real time. Unlike other globally consistent databases, real-time is not a central component of FaunaDB’s protocol. FaunaDB does not rely on global synchronization of clocks across servers. The notion of “before” and “after” is entirely dependent on the order in which transactions appear in the global log. This is important as log segments can skew a bit from one another with regard to their last committed epoch. This skew is inherent in a distributed system as the various data nodes in the cluster can be exposed to different loads and interruptions. To resolve this and preserve determinism, data nodes within a replica ensure that they have read transaction batches for the same not-yet-applied epoch from all log segments that participated in the consensus for that epoch. The transaction batches are then merged in the order of the numbered log segments and applied.
Transaction Log Architecture
In FaunaDB, transactions are durably written to a distributed log before they are applied. Each replica in the Fauna cluster can contain a copy of the log. A typical cluster will contain at least three replicas with copies of the log. A Raft-inspired protocol is used to achieve consensus among the log replicas. Odd numbers of replicas allow the consensus algorithm to achieve a quorum in the face of log replica disruption or failure.
Each replica of the log can contain one or more log segments. In these cases, there will be a one-to-one relationship between the log segments across the replicas. A very typical deployment would be a cluster with three replicas each containing three nodes. Each of the three replicas would contain a copy of the log and each log would be divided into three log segments as depicted below.
In clusters where some replicas do not contain copies of the log or have unbalanced node counts, there may be nodes that do not contain any log segments. The diagram below provides an idea of how diverse such a configuration can be. In this example, we have a cluster with four replicas. Only three of the replicas contain copies of the log. In those replicas, each contains three log segments. This is the maximum number of log segments possible as Region 1 and 3 only have three nodes thereby capping the maximum log segments to three.
Transactions in Fauna are not written to the log individually. Coordinators send transactions to log nodes which are then batched together in-memory and written to a log segment. This process occurs every 10 ms and we call it an epoch. The log nodes within a single segment achieve consensus on their contents every epoch. This is referred to as “commit” in Raft. At the point of the log consensus, all transactions are deterministic and durable.
In addition to the transaction logs, the system also maintains a global log at the cluster level to maintain the log topology state. The information in this topology state includes the currently active log segment IDs and the assignments of hosts that participate as log nodes in those segments, and the current epoch position for each segment.
Log Segment Status
As illustrated above, the nodes in a cluster need to be aware of the current state of the log topology to make sure the correct batches are applied for each epoch. FaunaDB keeps an independent Raft log that maintains the state of the log topology. Logs segments can be in one of several different states. When a log segment is first created, it is Uninitialized. The first log node in the list of its members will take it upon itself to initialize the segment's Raft log. Other members will try to join it. If the membership of the segment changes during this initialization process, it will be started over. Only once all initially known members have joined will the initializing node register it as Started in the log topology Raft state and assigned a start epoch. The start epoch is in the logical future so that when the data nodes reach that point they can all observe its existence in a consistent manner. Once the start epoch rolls around, the log nodes are free to start committing batches into the segment log.
When a segment needs to be closed, it is registered in topology Raft log as Closed with a record of its ending epoch. When the coordinators notice that the segment is in a closed state, they will cease to send it transactions. The log node itself will refuse any transactions sent to it and will also reject any transactions it has pending in memory. The leader for the segment will keep committing batches (which at this point will be empty) until its end epoch rolls around.
Log Segment Membership
While there'd be many ways to assign nodes to segments, the algorithm that FaunaDB internally follows can be imagined as orthogonally cross-cutting the replicas with segments. If we have three replicas (A, B, C) with five nodes in each (A1-A5, B1-B5 and C1-C5), then the system will allocate five segments, with each segment having a node in every replica: segment 1 will have A1, B1, C1 nodes, segment 2 will have A2, B2, C2 nodes, up to segment 5 having nodes A5, B5, C5. Segments are enlarged when a new replica is added (if we add a replica D with nodes D1-D5, then segment 1 will become A1, B1, C1, D1, etc.), and new segments are added when all replicas are enlarged: if we add nodes A6, B6, C6, D6 to every replica, then a new segment 6 will be created with them as members.
When a log segment is closed, it is closed on all its remaining nodes in other replicas too. Those nodes are freed up and can later participate in new log segments. As discussed above, a node does not have to have an active log segment to continue its roles as a coordinator and data node.
Every FaunaDB replica has a type describing the level of functionality its nodes will provide. The three types, in increasing order of functionality, are named compute, data, and data+log.
- Compute - nodes in compute replicas do not store data, nor do they participate in the distributed transaction log. They can receive queries from clients and execute them, relying on nodes in other replicas for data storage and the transaction log.
- Data - nodes in data replicas execute queries from clients just as compute replicas do, but in addition, they also store data. The full data set is partitioned between nodes in a single replica, and the replica as a whole will contain the full data set.
- Data+Log - nodes in data+log replicas have all the functionality of a data replica, but in addition, they also participate in the distributed transaction log.
The type of the replica can be set any number of times during its lifetime, as long as the cluster at all times has at least one data+log replica. When the cluster is initialized, the replica of the first node is by default set to be of the data+log type, while all additional replicas start out as compute replicas and need to be explicitly set to a different type in order to participate in either the data or log replication.
When a replica type is changed from data (or data+log) to compute, and the topology change passes safety checks to ensure it won’t lose data or compromise availability, it will immediately discard its full data set and if it is later reclassified as data, it will need to re-acquire data from other replicas, which can take some time. Data replicas that are acquiring data become gradually available for both reads and writes (and are thus useful data members of the cluster) as their data acquisition process progresses.
It is recommended that the cluster should have at least 3 data+log replicas in order to ensure the cluster's ability to process writes in the event that one replica goes down. Write latency is affected by the physical location of the nodes which replicate the transaction log. Writes originating closer to the nodes replicating the transaction log will have lower latency than writes originating from farther away.
Useful commands for log management
- To review the configuration of the cluster issue the following command
- To set a replica to ‘data+log’ issue the command using fauna-admin on any of the nodes in that replica
$faunadb-admin update-replica data+log replica2
- You may also specify multiple replicas in the same command
$faunadb-admin update-replica data+log replica2 replica3
- In order to change a replica to a non-log role simply update the replica using the following command
$faunadb-admin update-replica compute replica2
If you enjoyed this topic and want to work on systems and challenges just like this, Fauna is hiring!