redis-rb icon indicating copy to clipboard operation
redis-rb copied to clipboard

Redis::Distributed's shard distribution doesn't seem very even?

Open jdelStrother opened this issue 2 years ago • 8 comments

Hi there - we've got a couple of redis nodes set up as a cache for Rail's RedisCacheStore. I noticed that one of these nodes was receiving much more traffic than the other.

It seems like the default sharding doesn't balance between two nodes very well. With something like this -

count = Hash.new(0)
r = Redis::Distributed.new([{ url: 'redis://1.1.1.1:6379/0' }, { url: 'redis://2.2.2.2:6379/0'}], driver: :hiredis)

100000.times {
  key = SecureRandom.hex
  node = r.node_for(key)
  count[node.instance_variable_get("@options")[:url]] += 1
}
count
=> {"redis://2.2.2.2:6379/0"=>62322, "redis://1.1.1.1:6379/0"=>37678}

so here the 2.2.2.2 node was chosen 40% more often that the 1.1.1.1. The key (SecureRandom.hex) here is obviously pretty artificial, but the ratio seems to match the extra traffic I'm seeing hitting one of my nodes with Rails' actionview keys.

Is this expected behaviour, or is something weird going on?

jdelStrother avatar Mar 15 '22 23:03 jdelStrother

@jdelStrother I was able to reproduce.

It all surely depends on the urls and cache keys, so sometimes you can get ~ even distribution, sometimes 60/40 or 70/30. For 2 nodes, no matter which urls I used, I almost always got the distribution similar from your example. For 3 nodes a little better, but sometimes also a big skewing.

One approach to improve the distribution, is to increase the number of points for each server. Something like:

ring = Redis::HashRing.new([], 1024)
r = Redis::Distributed.new([{ url: 'redis://1.1.1.1:6379/0' }, { url: 'redis://2.2.2.2:6379/0'}], ring: ring)

I investigated how dalli implements the distributed client. The approach is basically the same (160 points, crc32, etc), except they use a little different hashing for server nodes: https://github.com/petergoldstein/dalli/blob/4f6ffac1e1b19222a42619669d32f42701277941/lib/dalli/ring.rb#L133-L134

I tried to apply this change to Redis::HashRing and run your example. And I saw almost even distribution all the time! 🤔 I ran It for lots of combinations and different urls and different numbers of them. I am not strong in cryptography, so I do not know why is that.

If this is a good approach, the hashing algorithm probably can be changed. But this will also lead to a lot of cache misses when deployed for the first time.

@byroot Do you have any thoughts on this?

fatkodima avatar Aug 12 '22 20:08 fatkodima

If this is a good approach, the hashing algorithm probably can be changed. But this will also lead to a lot of cache misses when deployed for the first time.

Since I'm working on the next major, it might be a good time to make such a change.

Redis::Distributed is a very old feature, long before I took over as maintainer, so I don't know why this algo in particular was chosen, whether it's arbitrary or if it follow some kind of spec other redis client follow too? Would be good to check other langauges clients to see what they do

byroot avatar Aug 13 '22 12:08 byroot

So from the list https://redis.io/docs/clients/ I found 2 implementations of consistent hashing:

php: https://github.com/predis/predis/blob/main/src/Cluster/Distributor/HashRing.php - for hashing server nodes uses custom user-defined functions or .to_s, if none. For keys - crc32.

go: https://github.com/go-redis/redis/blob/master/ring.go - for hashing server nodes and keys uses xxhash (http://cyan4973.github.io/xxHash/).

I searched through the history of this gem and found that originally hashing was done by md5, but then changed to crc32 in https://github.com/redis/redis-rb/commit/5cd3a39e525b32923ea1315dfcf697dcc7d586ec. Maybe because crc32 is faster to compute? That change was done 13 years ago, but maybe you remember why, @mperham? 😄

fatkodima avatar Aug 13 '22 15:08 fatkodima

Ok, so there's no standard to adhere to. So we can probably change it in a major without any issue.

byroot avatar Aug 13 '22 15:08 byroot

I believe that’s code borrowed from Dalli as I never worked on redis-rb back then that I can recall.

I’m pretty sure that I chose crc32 because it was easy to access (being in stdlib) and very fast. You are right that it is not particularly well suited for this use case. I’d recommend surveying a few more clients and choosing a real random hashing scheme. md5 should be fast and a reasonable choice.

mperham avatar Aug 13 '22 16:08 mperham

@byroot So do you think we should change hashing only for server keys (so sha1/md5 for servers and crc32 for cache keys, like in dalli) or for both servers keys and cache keys?

fatkodima avatar Aug 13 '22 16:08 fatkodima

That’s your call but I’m pretty sure crc32 isn’t well suited for either use case.

mperham avatar Aug 13 '22 16:08 mperham

@fatkodima I'd go with MD5 for everything as it's likely the best perf / distribution ration available in the stdlib. We should probably make that configurable though. e.g. accept a callable, or simply call on common interface like #digest or #hexdigest.

byroot avatar Aug 13 '22 16:08 byroot