Simon Guindon

Large Scale Distributed Systems

My GitHub

Benchmarking go redis server libraries

October 24, 2016

There has been a lot of discussions on Twitter and the Go performance slack channel recently about the performance of Go Redis protocol libraries. These libraries give you the ability to build a service of your own that supports the Redis protocol. There are 2 libraries that seem to be of interest in the community. Redcon and Redeo.

I needed to support the Redis protocol in a soon to be announced project and I thought it might be useful to others if I published my benchmark findings while evaluating these libraries.

Hardware setup

  • Client and servers are on independent machines.
  • Both systems have 20 physical CPU cores.
  • Both systems have 128GB of memory.
  • This is not a localhost test. The network between the two machines is 2x bonded 10GbE.

Software setup

The Go in-memory Map implementations for Redcon and Redeo are sharded but each shard is protected by a writer lock. I’ve written a tool called Tantrum to aid in automating the benchmark runs and visualizing the results. With the help of a script and Tantrum I’ve benchmarked various configurations of concurrent clients and pipelined requests to see how different workloads affect performance.

Redis: All disk persistence is turned off. I wrote a version of redis-benchmark that supports microsecond resolution instead of millisecond resolution because we were losing a lot of fidelity in some of the results.

Go garbage collector

The Go garbage collector has received some nice performance improvements as of late but the Go 1.7 GC still struggles with larger heaps. This can surface with one or multiple large Map instances. You can read the details here. Luckily in master there’s a fix that reduces GC pauses in these cases by 10x which can make a 900ms pause down to 90ms which is a great improvement. I’ve decided to benchmark against this fix because this will likely ship in Go version 1.8.

CPU efficiency

----system---- ----total-cpu-usage---- -dsk/total- -net/total- ---most-expensive---
     time     |usr sys idl wai hiq siq| read  writ| recv  send|  block i/o process
07-10 03:39:01|  4   1  94   0   0   1|   0     0 |  56M 8548k|
07-10 03:39:02|  4   1  94   0   0   1|   0     0 |  56M 8539k|
07-10 03:39:03|  4   1  94   0   0   1|   0     0 |  56M 8553k|

Figure 1: Redis CPU usage during 128 connection / 32 pipelined request benchmark.

Shown in Figure 1 Redis used less CPU resources but it’s single threaded design limits its ability to fully utilize all the CPU cores.

----system---- ----total-cpu-usage---- -dsk/total- -net/total- ---most-expensive---
     time     |usr sys idl wai hiq siq| read  writ| recv  send|  block i/o process
07-10 03:52:21| 35  11  51   1   0   1|   0     0 |  57M 8701k|
07-10 03:52:22| 35  12  51   1   0   1|   0     0 |  56M 8585k|
07-10 03:52:23| 33  12  52   2   0   1|   0     0 |  56M 8636k|

Figure 2: Redcon and Redeo CPU usage during 128 connection / 32 pipelined request benchmark.

Shown in Figure 2 Redcon and Redeo both utilized multiple CPU cores better than Redis and allow higher throughput per process however not as efficiently as Redis. This means that 1 Redcon or Redeo process can outperform 1 Redis process however if you ran multiple Redis processes you would experience higher throughput than Redcon or Redeo (at the cost of deployment complexity).

This is a Hyperthreaded machine which means 50% (usr + sys) usage indicates near CPU saturation. This means the lack of free CPU cycles is getting in the way of greater throughput. I’m concerned that Figure 2 shows IOWAIT delays.

Benchmark passes

The combinations of the following configurations were used to record a total of 63 benchmark runs.

Connections
64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384
Pipelined requests
1, 4, 8, 16, 32, 64, 128

It’s not uncommon in production environments I’ve seen to have thousands or tens of thousands of client connections to a single Redis instance. Each benchmark run lasts for 10 minutes per service for a total duration of 30 minutes of recorded results (35 minutes approximately with results processing).

There were 2 passes of those 63 benchmark runs recorded. Each pass is 32 hours long for a total 64 hours of recorded benchmarking (74 hours total including processing).

  • First pass
    Redis, Redcon and Redeo are freshly restarted processes but warmed up before the benchmark begins.
  • Second pass
    Redis, Redcon and Redeo services have been running for over a week having run over 80 hours of benchmarking. This will tell us how a longer running process performs.

Summary of results

There isn’t much difference between the first and second pass results besides what I would consider regular variance. Both show throughput of 1.2 million SET requests/second in the same runs with comparable latency characteristics.

Overall Redis has more predictable latency distribution with lower outliers in the 99-100% range than Redcon or Redeo. Redcon and Redeo have higher throughput with lower latency throughout the 0%-99% range but exhibits higher and sometimes excessive outliers in the 99%-100% range. I suspect this is due to the GC pauses.

As client connections increase the single threaded design of Redis starts to show signs of weakness. Redcon performs well at higher connection counts with the ability to use more CPU cores but does start to hit some limitations that show up in the response latency as the CPU gets saturated. Redeo is unpredictable. Sometimes it has great response latency at comparable throughput as Redis but sometimes it has bad response latency with the lowest throughput.

In many cases Redcon out perform Redis both in throughput and latency. At times it can provide 2x throughput at 50% lower latency throughout most of the percentiles up until the 99% mark. This isn’t surprising since it isn’t single threaded.

Redis does very well with what it has available with a single thread. Sharding multiple Redis instances will yield great results at a complexity cost. While Redeo and more so Redcon benefit from a multi-threaded design and can sometimes handily outperform Redis, they are still falling short at maximizing the CPU resources available. At best they are 2x higher throughput or 50% lower latency while having over 20x more CPU resources available. My opinion is there are some CPU bottlenecks in the code that needs some work. With a multi-threaded design and pipelining they should be able to achieve network saturation but they aren’t anywhere near that because they are saturating the CPU.

First pass results

30m46.145808132s      64/1 31m54.915331448s      64/4 33m18.954527192s      64/8 35m38.670499732s      64/16 35m57.70865066s       64/32 33m42.103557647s      64/64 35m58.991079487s      64/128 31m3.761905733s       128/1 32m27.354653488s      128/4 34m13.655440231s      128/8 35m50.602510961s      128/16 35m21.252212053s      128/32 35m42.410185798s      128/64 35m46.531760901s      128/128 30m45.271462955s      256/1 32m28.119180084s      256/4 33m27.796582097s      256/8 35m56.82829609s       256/16 33m42.667466329s      256/32 35m50.150773509s      256/64 34m3.364519318s       256/128 31m2.605489435s       512/1 32m27.317909709s      512/4 34m2.493518054s       512/8 36m10.265704985s      512/16 35m44.295602286s      512/32 35m59.046439084s      512/64 34m8.447420798s       512/128 31m4.847725062s       1024/1 31m34.371756627s      1024/4 34m27.924214506s      1024/8 36m45.18276848s       1024/16 37m35.898693978s      1024/32 35m51.042629847s      1024/64 36m16.456633136s      1024/128 30m52.305757298s      2048/1 32m25.956440667s      2048/4 33m53.535871481s      2048/8 35m10.484188444s      2048/16 35m23.073785143s      2048/32 35m17.670328868s      2048/64 35m5.750874151s       2048/128 30m52.137323909s      4096/1 31m55.711697359s      4096/4 32m48.11976969s       4096/8 35m17.842507084s      4096/16 36m57.763236335s      4096/32 35m14.315455611s      4096/64 33m44.700852885s      4096/128 31m3.304851101s       8192/1 32m22.559970232s      8192/4 33m3.912727808s       8192/8 34m40.181481378s      8192/16 36m13.040660784s      8192/32 36m41.030094708s      8192/64 34m59.370625586s      8192/128 31m25.524706043s      16384/1 32m23.53607968s       16384/4 33m12.659305177s      16384/8 34m19.785753273s      16384/16 35m59.43791272s       16384/32 35m45.462468929s      16384/64 37m10.095512066s      16384/128

Second pass results

31m3.322475412s       64/1 32m42.38450521s       64/4 33m25.159167175s      64/8 34m58.887443776s      64/16 36m0.847855503s       64/32 35m18.62984506s       64/64 33m26.403964065s      64/128 30m52.935227999s      128/1 32m39.847335892s      128/4 33m22.998128944s      128/8 35m39.355654102s      128/16 37m6.338372132s       128/32 34m4.376775327s       128/64 35m12.063702525s      128/128 30m43.359264281s      256/1 31m38.929016957s      256/4 33m24.552355671s      256/8 36m51.827534508s      256/16 35m37.664602435s      256/32 33m31.277769482s      256/64 35m58.997174736s      256/128 30m51.599212418s      512/1 32m5.932652855s       512/4 33m24.46653593s       512/8 35m29.630088917s      512/16 37m4.720939476s       512/32 33m33.102139113s      512/64 34m59.409783875s      512/128 30m54.412707407s      1024/1 31m44.762055229s      1024/4 33m51.095726432s      1024/8 36m28.53817141s       1024/16 35m50.70489885s       1024/32 33m1.520941695s       1024/64 34m41.788418729s      1024/128 30m54.684691391s      2048/1 32m18.778053272s      2048/4 33m13.418346007s      2048/8 35m11.909186265s      2048/16 33m48.627801706s      2048/32 35m31.771586183s      2048/64 34m2.297144182s       2048/128 30m53.445688554s      4096/1 32m21.005815217s      4096/4 32m51.908892612s      4096/8 36m0.02284276s        4096/16 35m43.313700241s      4096/32 33m55.440627422s      4096/64 35m48.449855055s      4096/128 30m50.553462312s      8192/1 31m58.140338881s      8192/4 33m22.950393447s      8192/8 35m9.207854768s       8192/16 35m21.141004574s      8192/32 33m39.927007678s      8192/64 34m15.455992783s      8192/128 31m19.726389646s      16384/1 32m39.210737143s      16384/4 32m55.005394811s      16384/8 36m3.536117383s       16384/16 37m14.812024799s      16384/32 34m9.420295346s       16384/64 34m12.475975319s      16384/128


Copyright © 2017 Simon Guindon.
Non-commercial re-use with attribution encouraged; all other rights reserved.