Benchmarking go redis server libraries

October 24, 2016

There has been a lot of discussions on Twitter and the Go performance slack channel recently about the performance of Go Redis protocol libraries. These libraries give you the ability to build a service of your own that supports the Redis protocol. There are 2 libraries that seem to be of interest in the community. Redcon and Redeo.

I needed to support the Redis protocol in a soon to be announced project and I thought it might be useful to others if I published my benchmark findings while evaluating these libraries.

Hardware setup

Client and servers are on independent machines.
Both systems have 20 physical CPU cores.
Both systems have 128GB of memory.
This is not a localhost test. The network between the two machines is 2x bonded 10GbE.

Software setup

The Go in-memory Map implementations for Redcon and Redeo are sharded but each shard is protected by a writer lock. I’ve written a tool called Tantrum to aid in automating the benchmark runs and visualizing the results. With the help of a script and Tantrum I’ve benchmarked various configurations of concurrent clients and pipelined requests to see how different workloads affect performance.

Redis: All disk persistence is turned off. I wrote a version of redis-benchmark that supports microsecond resolution instead of millisecond resolution because we were losing a lot of fidelity in some of the results.

Go garbage collector

The Go garbage collector has received some nice performance improvements as of late but the Go 1.7 GC still struggles with larger heaps. This can surface with one or multiple large Map instances. You can read the details here. Luckily in master there’s a fix that reduces GC pauses in these cases by 10x which can make a 900ms pause down to 90ms which is a great improvement. I’ve decided to benchmark against this fix because this will likely ship in Go version 1.8.

CPU efficiency

----system---- ----total-cpu-usage---- -dsk/total- -net/total- ---most-expensive---
     time     |usr sys idl wai hiq siq| read  writ| recv  send|  block i/o process
07-10 03:39:01|  4   1  94   0   0   1|   0     0 |  56M 8548k|
07-10 03:39:02|  4   1  94   0   0   1|   0     0 |  56M 8539k|
07-10 03:39:03|  4   1  94   0   0   1|   0     0 |  56M 8553k|

Figure 1: Redis CPU usage during 128 connection / 32 pipelined request benchmark.

Shown in Figure 1 Redis used less CPU resources but it’s single threaded design limits its ability to fully utilize all the CPU cores.

----system---- ----total-cpu-usage---- -dsk/total- -net/total- ---most-expensive---
     time     |usr sys idl wai hiq siq| read  writ| recv  send|  block i/o process
07-10 03:52:21| 35  11  51   1   0   1|   0     0 |  57M 8701k|
07-10 03:52:22| 35  12  51   1   0   1|   0     0 |  56M 8585k|
07-10 03:52:23| 33  12  52   2   0   1|   0     0 |  56M 8636k|

Figure 2: Redcon and Redeo CPU usage during 128 connection / 32 pipelined request benchmark.

Shown in Figure 2 Redcon and Redeo both utilized multiple CPU cores better than Redis and allow higher throughput per process however not as efficiently as Redis. This means that 1 Redcon or Redeo process can outperform 1 Redis process however if you ran multiple Redis processes you would experience higher throughput than Redcon or Redeo (at the cost of deployment complexity).

This is a Hyperthreaded machine which means 50% (usr + sys) usage indicates near CPU saturation. This means the lack of free CPU cycles is getting in the way of greater throughput. I’m concerned that Figure 2 shows IOWAIT delays.

Benchmark passes

The combinations of the following configurations were used to record a total of 63 benchmark runs.

Connections
64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384
Pipelined requests
1, 4, 8, 16, 32, 64, 128

It’s not uncommon in production environments I’ve seen to have thousands or tens of thousands of client connections to a single Redis instance. Each benchmark run lasts for 10 minutes per service for a total duration of 30 minutes of recorded results (35 minutes approximately with results processing).

There were 2 passes of those 63 benchmark runs recorded. Each pass is 32 hours long for a total 64 hours of recorded benchmarking (74 hours total including processing).

First pass
Redis, Redcon and Redeo are freshly restarted processes but warmed up before the benchmark begins.
Second pass
Redis, Redcon and Redeo services have been running for over a week having run over 80 hours of benchmarking. This will tell us how a longer running process performs.

Summary of results

There isn’t much difference between the first and second pass results besides what I would consider regular variance. Both show throughput of 1.2 million SET requests/second in the same runs with comparable latency characteristics.

Overall Redis has more predictable latency distribution with lower outliers in the 99-100% range than Redcon or Redeo. Redcon and Redeo have higher throughput with lower latency throughout the 0%-99% range but exhibits higher and sometimes excessive outliers in the 99%-100% range. I suspect this is due to the GC pauses.

As client connections increase the single threaded design of Redis starts to show signs of weakness. Redcon performs well at higher connection counts with the ability to use more CPU cores but does start to hit some limitations that show up in the response latency as the CPU gets saturated. Redeo is unpredictable. Sometimes it has great response latency at comparable throughput as Redis but sometimes it has bad response latency with the lowest throughput.

In many cases Redcon out perform Redis both in throughput and latency. At times it can provide 2x throughput at 50% lower latency throughout most of the percentiles up until the 99% mark. This isn’t surprising since it isn’t single threaded.

Redis does very well with what it has available with a single thread. Sharding multiple Redis instances will yield great results at a complexity cost. While Redeo and more so Redcon benefit from a multi-threaded design and can sometimes handily outperform Redis, they are still falling short at maximizing the CPU resources available. At best they are 2x higher throughput or 50% lower latency while having over 20x more CPU resources available. My opinion is there are some CPU bottlenecks in the code that needs some work. With a multi-threaded design and pipelining they should be able to achieve network saturation but they aren’t anywhere near that because they are saturating the CPU.

First pass results

64/4" /> 64/8" /> 64/16" /> 64/32" /> 64/64" /> 64/128" /> 128/1" /> 128/4" /> 128/8" /> 128/16" /> 128/32" /> 128/64" /> 128/128" /> 256/1" /> 256/4" /> 256/8" /> 256/16" /> 256/32" /> 256/64" /> 256/128" /> 512/1" /> 512/4" /> 512/8" /> 512/16" /> 512/32" /> 512/64" /> 512/128" /> 1024/1" /> 1024/4" /> 1024/8" /> 1024/16" /> 1024/32" /> 1024/64" /> 1024/128" /> 2048/1" /> 2048/4" /> 2048/8" /> 2048/16" /> 2048/32" /> 2048/64" /> 2048/128" /> 4096/1" /> 4096/4" /> 4096/8" /> 4096/16" /> 4096/32" /> 4096/64" /> 4096/128" /> 8192/1" /> 8192/4" /> 8192/8" /> 8192/16" /> 8192/32" /> 8192/64" /> 8192/128" /> 16384/1" /> 16384/4" /> 16384/8" /> 16384/16" /> 16384/32" /> 16384/64" /> 16384/128" />