May 03, 2019


Blog Categories

Categories

Blog Categories

Performance Preview of FaunaDB 2.7 with YCSB

FaunaDB is a distributed operational database that guarantees data correctness, without operational complexity, at a global scale. At Fauna, horizontal and vertical scalability are both very important to us, since they improve not only performance, but also operational flexibility.

One area we have been spending time on is multi-core scalability. Historically, many database systems could scale to faster cores, but not to more cores, due to contention on shared local resources like locks. But CPU parallelism, whether discrete cores or hyperthreads, is increasingly common in the cloud era and we need to make sure FaunaDB can take full advantage of it. 

This post compares FaunaDB 2.7 to 2.6. We are happy to report an approximately 25% across-the-board improvement in both read and write throughput regardless of core count, but in particular, up to 200% or more improvement in read throughput on higher core count machines. And of course, because of its Calvin transaction protocol, FaunaDB maintains these performance improvements even at global replication latencies.

The below benchmarks capture a single dimension of FaunaDB’s overall performance profile; we look forward to exploring many other dimensions in subsequent blog posts.

Setup

The purpose of this test was to estimate how FaunaDB’s throughput responds to CPU parallelism. Three separate YCSB workloads were executed on 3 different vCPU configurations on a cluster running on the Google Cloud Platform (GCP). The cluster configuration, a single replica with 3 nodes, remained constant throughout the tests. Only the number of vCPU cores on each node was changed before each run.

Configuration

Description

Cluster

1 Replica with 3 nodes

vCPUs

4, 8, 16 virtual cores @2.30 GHz

Memory

26 GB

Storage

2 TB of locally attached SSD

YCSB Workload

Workload A: 50% Read 50% Write

Workload B: 95% Read 5% Write

Workload C: 100% Read

Java

JDK 11

Hugepages

Enabled

OS VersionCentOS 7


For each test run, the following procedure was followed:

  1. Setup the FaunaDB cluster with the specific node configuration
  2. Initialize the FaunaDB cluster
  3. Create a database and schema for the test
  4. Populate the database with test data
  5. Run the workload

Each scenario was run repeatedly, increasing the number of concurrent clients each time. At lower core counts, more client threads would eventually lead to errors when the servers became saturated. Even though additional throughput was likely available, only client concurrency levels that achieved 100% success rates have been included in the results.

In each of the workload scenarios, the client threads either read or updated records each of size 1 KB in a single transaction. No other userspace processes were running on the servers. Each test lasted anywhere between 3 to 8 minutes. In the 4 and 8 core setups, the number of client threads ranged from 300 to 2,000. In the 16 core setup with more headroom available, the number of threads ranged from 300 to 4,000.

Observations

The following table enumerates the operations per second, the number of client threads, and the vCPU parallelism in the three workload scenarios tested.

Workload A (50% Read, 50% Update)

This workload simulates equal proportions of reads and writes. The throughput ranged from 3,000 to 15,500 operations per second at 4,000 client threads. At 2,000 client threads, throughput scaled effectively with cores, although not quite linearly. Writes by their nature are more contended than reads, so this is expected, although we will continue to improve it over time.

Workload A  
2.7

4 Cores

8 Cores

16 Cores

Threads

Ops/Sec

Avg RunTime

Ops/Sec

Avg RunTime

Ops/Sec

Avg RunTime

300

3,069

5 m 26 s

4,249

7 m 51 s

5,200

6 m 25 s

500

3,594

4 m 38 s

5,508

6 m 3 s

7,557

4 m 25 s

1,000

3,974

4 m 12 s

7,148

4 m 40 s

9,927

3 m 21 s

1,500

4,256

3 m 55 s

8,009

4 m 10 s

11,257

2 m 58 s

2,000

4,435

3 m 45 s

8,638

3 m 52 s

12,757

2 m 37 s

3,000

NA

NA

NA

NA

14,136

2 m 21 s

4,000

NA

NA

NA

NA

15,578

2 m 8 s


Charting the above throughput shows semi-linear scalability with increasing threads and CPU cores.

Throughput improvement in 2.7  compared to 2.6.3 for Workload A ranges from 16 to 28% depending on the configuration, i.e. the number of the CPU cores in the server. The following table shows the throughput operations from the same workload running on release 2.6.3.

Workload A 
2.6.3

4 Cores

8 Cores

16 Cores

Threads

Ops/Sec

Avg RunTime

Ops/Sec

Avg RunTime

Ops/Sec

Avg RunTime

300

2,748

6 m 4 s

3,490

5m 17s

6,293

4m 49s

500

3,157

5 m 17 s

4,041

5m 3s

7,092

4m 32s

1,000

3,602

4 m 38 s

4,901

4m 58s

8,293

4m 03s

1,500

3,774

4 m 25 s

5,755

4m 49s

8,896

3m 59s

2,000

3,798

4m 18 s

5,777

4m 37s

9,923

3m 48s

Workload B (95% Read, 5% Update)

This is a read-heavy workload with 5% writes. The maximum throughput was observed under 4,000 client threads was 31,000 operations per second. Here, due to the emphasis on reads, we start to see very close to linear scalability across cores.

Workload B
2.7

4 Cores

8 Cores

16 Cores

Threads

Ops/Sec

Avg RunTime

Ops/Sec

Avg RunTime

Ops/Sec

Avg RunTime

300

7,120

7 m 1 s

13,209

5 m 3 s

17,492

3 m 49 s

500

7,547

6 m 38 s

15,100

4 m 25 s

21,320

3 m 8 s

1,000

8,210

6 m 5 s

16,391

4 m 4 s

26,556

2 m 31 s

1,500

8,637

5 m 47 s

16,997

3 m 55 s

28,632

2 m 20 s

2,000

8,675

5 m 46 s

17,332

3 m 51 s

29,241

2 m 17 s

3,000

NA

NA

NA

NA

30,288

2 m 12 s

4,000

NA

NA

NA

NA

30,830

2 m 10 s

Just as we have seen in Workload A, a quick glance into the 2.6.3 Workload B numbers, below, reveal a similar 25% improvement across the various tested configurations, but even more dramatic improvements on the 16 core machines.

Workload B
2.6.3

4 Cores

8 Cores

16 Cores

Threads

Ops/Sec

Avg RunTime

Ops/Sec

Avg RunTime

Ops/Sec

Avg RunTime

300

5,892

8 m 29 s

8,249

7 m 37 s

10,476

5 m 53 s

500

6,219

8 m 2 s

9,329

6 m 51 s

12,874

5 m 25 s

1,000

6,639

7 m 32 s

10,091

6 m 5 s

13,623

4 m 49 s

1,500

6,406

7 m 48 s

9,993

5 m 49 s

12,991

3 m 59 s

2,000

7,078

7 m 4 s

10,617

5 m 43 s

13,908

3 m 51 s

Workload C (100% Read)

This is a read-only workload. The maximum throughput was 39,500 operations per second with 4,000 client threads. We see similar results compared to Workload B, with strong performance as cores increase.

Workload C
2.7

4 Cores

8 Cores

16 Cores

Threads

Ops/Sec

Avg RunTime

Ops/Sec

Avg RunTime

Ops/Sec

Avg RunTime

300

9,166

18 m 11 s

20,944

7 m 51 s

27,205

6 m 8 s

500

9,829

16 m 57 s

22,831

6 m 3 s

33,550

4 m 58 s

1,000

10,484

15 m 54 s

24,236

4 m 40 s

37,077

4 m 30 s

1,500

10,555

15 m 47 s

24,022

4 m 10 s

39,054

4 m 16 s

2,000

10,814

15 m 25 s

25,473

3 m 52 s

39,255

4 m 15 s

3,000

NA

NA

NA

NA

39,515

4 m 13 s

4,000

NA

NA

NA

NA

39,532

4 m 13 s

For the read-only Workload C, the improvement compared to 2.6.3 is been similar at lower core parallelism, but at 16 cores we see very substantial gains, where FaunaDB 2.7 is almost 250% faster.

Workload C
2.6.3 

4 Cores

8 Cores

16 Cores

Threads

Ops/Sec

Avg RunTime

Ops/Sec

Avg RunTime

Ops/Sec

Avg RunTime

300

5,863

28 m 26 s

8,795

16m 37s

11,961

13m 53s

500

6,786

24 m 34 s

10,111

15m 1s

14,155

12m 20s

1,000

6,979

23 m 53 s

10,399

14m 43s

13,519

12m 01s

1,500

7,380

22 m 35 s

10,996

14m 03s

15,614

11m 32s

2,000

7,982

20 m 53 s

12,053

13m 47s

16,392

11m 01s

Conclusions

From the above test results, we can conclude that:

  • A very modestly sized cluster of 3 commodity cloud compute instances is capable of meeting the throughput demands of most transactional workloads, while ensuring predictable performance.

  • FaunaDB distributes workloads evenly among all nodes within the replica and can meet increased throughput needs, especially for reads, simply by provisioning more cores. 

FaunaDB operators can start small, and as applications grow, they can scale their cluster by adding new nodes to existing replicas or adding new replicas in neighboring zones or regions. But if availability goals are already met, with FaunaDB 2.7, operators can also simply increase available cores, which can be done without any service interruption due to FaunaDB's distributed nature. 

Of course, for geographically distributed workloads, we recommend distributing multiple replicas across regions. We look forward to benchmarking this aggressively in upcoming posts.

If you enjoyed this topic and want to work on systems and challenges just like this, Fauna is hiring!