FaunaDB Outage Review
As software engineers, we all know too well that the question is not if you’ll hit a bug or have an outage, but when. However, as a database system provider, our responsibilities in this area are higher than most as our downtime becomes your downtime.
From 9/23 to 10/17, FaunaDB experienced a series of issues that impacted our users and I want to take some time to walk through them, how we dealt with them, and our steps to prevent them from reoccurring. I also want to touch briefly on our testing process so that you can get a sense of how serious we take our responsibility to you.
One of the first issues that arose in this period of instability was what we now refer to as "election storms". To understand this bug it is important to first understand a bit about FaunaDB. FaunaDB requires there to be a globally shared log to track transactions. We segment this log by the number of nodes in each replica to share the load. Each segment uses RAFT to manage consensus about the log segment across replicas.
When all is well in the system, a leader is elected in each RAFT ring and transactions flow into their various log segments. However, if an election is called because, for example, the node with the leader goes down, there is a subsecond pause in the flow of transactions while a new leader is elected. In a properly functioning system, this election is unnoticeable.
In the case of this incident, there was one replica that had two replicas connected to it on connections with very similar latency. Due to the network topology of the system, the one replica was in the middle of the other two. Occasionally a new leader election would be called, and the outer nodes would end up repeatedly calling elections due to a very narrow window where the nodes on the edge would think they were elected, and then get a request for an election from the far node. This pattern could loop effectively indefinitely. Eventually, natural fluctuations in latency would cause an election to stick and a leader was chosen, but during that "storm", no transactions could be processed.
One of the stats we track in the system is the number of elections being called as it is an indicator of overall system health. We were able to see that elections were spiking and identified that restarting one of the involved nodes rectified the issue, but it took significant debugging and research of the related logs to figure out the cause of the issue.
As is frequently the case with these sorts of distributed consensus algorithms, the solution was to add some variance to the call for an election time such that any cycle of elections would be more quickly broken. Being this was a fundamental change to the system, the engineering team moved cautiously, restarting nodes involved in the election storms as necessary keep the system operating. This fix was rolled out to production in version 2.9.1 and no further election storm events have been seen.
In the midst of triaging these election storms, a separate issue occurred which caused two additional outages on 10/9 and 10/10. With the Calvin algorithm, there is a transaction log that all replicas write to. Transactions are written to the log in a globally ordered fashion; as soon as a quorum of nodes acknowledge that a transaction has been written to their logs and at least one node writes the transaction to storage, a response can be given back to the requestor. Given that all of the replicas have the same data, as long as they apply the transactions in the same order we can guarantee that the transaction will be replayed the same way throughout the system.
It is important to note that with FaunaDB the entry is written to the log and then separately executed and stored. Being a deterministic transactional system, transactions must be applied in order.
With the first incident, a transaction made it into the log that could be written to the log, but then could not be applied to storage. We have guards in place to reject transactions that cannot be applied to storage, but in this case, the data was in the system from a time before we had these guards and was picked up by a new index building task. Unfortunately, even with the large amounts of introspection we have in the system, it was not clear at first blush what was occurring and took a while to triage and develop a remediation.
With the second incident, a malformed transaction was introduced to the write log by an internal administrative tool. The method that caused the write to fail was different from the one from the first incident, and it required using the remeditation technique two times to get the system to return to normal.
Since then we have hardened the system in a number of ways to prevent this category of problems. In 2.9.1 (currently live in production) we put guards in place to defend against existing oversized data from blocking the transaction pipeline. We've also added admin commands to make it easy for the on-call engineer to remove a blocking transaction. In our upcoming release, we will add functionality to automatically quarantine a write that fails to apply so that the transaction log will continue to flow. The failure to apply in all three cases was a side effect of constraints in our storage layer. In addition to the measures above, we are also researching new storage engines that will be more flexible than our current one.
Latency Spikes in SA and Singapore
The final incident in the series of issues was a large spike in latency for users in South America and Asia on 10/18. The on-call team began looking into the regions involved and saw that the connection latency between our new compute-only nodes in South America and Asia and the rest of the cluster had sky-rocketed.
Compute-only nodes are nodes that are configured to process user requests, but do not have their own copy of the data, so they must request data to calculate a transaction from the rest of the cluster. The nodes were set up this way to prevent user data from being stored outside of areas protected by GDPR and US Privacy Shield.
The team ran test queries from the impacted regions and saw that queries routed over the public internet to the data replicas in North America were faster than queries routed through our compute-only replicas. With no local data, these compute-only nodes had to make multiple internal read requests over high latency connections to data nodes in North America. In comparison, with queries routed directly to the North America replicas, the initial connection had high latency, but the internal read requests were over extremely low latency connections. Unfortunately this undermined the expected benefit of local compute nodes for the new regions.
We have decided to temporarily decommission the compute-only nodes until we can bring them back as full data replicas.
How We Test
It would be well worth a full article on our testing process, but I think it is important to at least briefly talk about it to give some insight into the process.
We currently have four test suites that we run:
- Single machine test suite - this is a suite of tests that can be run on a developer's machine. Some folks might call these smoke tests as we use them to validate pull requests, but they take quite a bit longer to run than a standard unit or smoke test suite. These tests stand up a FaunaDB cluster and run a variety of tests. There are also thousands of unit tests of various subsystems as a part of this process.
- Regression suite - this is a battery of tests that operate against a cluster running on multiple machines. This test suite runs for about eight hours and executes a number of tests with controlled replica and node failures. Due to the long run time, this suite is run nightly. It can be thought of as a preprogrammed "chaos monkey", causing node and replica failures in a pattern calculated by the dev team to make sure failure cases that could impact availability or correctness are caught.
- Jepsen - since the release of our Jepsen report in early 2019, we have run Jepsen on a regular basis to validate that we are maintaining our correctness claims as we make changes to fundamental systems like the RAFT changes mentioned above. Jepsen immensely stresses database systems and has been a good method to validate our system. We also use it as a soak tester in some cases as it can be run continuously.
- YCSB - we use YCSB as a primary performance test framework as it is widely understood in the industry. YCSB is not meant for distributed databases, but we have built custom workloads to be able to stress FaunaDB in more interesting ways. YCSB is also run nightly so that we can catch performance regressions more easily.
We also have a very robust code review process to ensure all FaunaDB changes are thoroughly reviewed and vetted. Two reviewers are required to approve any commit to master.
We have now been running with version 2.9.1 for a couple of weeks and have had no incidents related to FaunaDB since its release. It was important to us to make sure that we were confident in our fixes before we came back to our community to say the issues were addressed. In the future, we would like to get the cycle time from incident to this sort of writeup to be shorter, but we will tend to err on the side of caution; quick bandage fixes are not the way to maintain and grow a database system like FaunaDB.
As part of the incidents above we have also refined our incident response process and replaced our former status page with an industry standard which is both easier for us to use and easier for you to consume. Those status updates are also published to our @faunastatus twitter account and to #cloud-status in our community Slack.
In addition to the fixes mentioned above, there is a lot of research going on to minimize latency spikes in the system and improve FaunaDB’s performance and operational stability. The fixes mentioned above are just part of the work that we're doing to make sure FaunaDB is the stable, globally available system you need.
If you have any questions about anything from this post, or about our processes in general, please feel free to reach out to me at firstname.lastname@example.org or in our community Slack and I will be happy to answer them. From the engineering team, we highly appreciate your support during the incidents and are hard at work to ensure that data storage is the easiest part of building your applications.
As you can tell from the incidents above, building a global deterministic transactional database brings an endless number of interesting problems to solve. If you think these sound like the sort of challenges you’d like to tackle, we’re hiring!