crystal-redis
crystal-redis copied to clipboard
Weird behavior under "high traffic/concurrency"
Hi all 👋,
A bit of context first, I'm a Ruby dev (so Crystal noob here) and I decided to use Kemal to implement our ping service (used by our mobile app to display a maintenance view). It's called by the mobile apps every 30s. Our first version was not using Redis at all and our Kemal server was serving up to 1000req/s flawlessly.
Lately, we decided to use Redis to control our Ping service. Our endpoint is super simple:
redis = Redis::PooledClient.new(url: ENV.fetch("REDIS_URL") { nil })
get "/ping" do |env|
env.response.content_type = "application/json"
redis.get("maintenance_started_at") ? hibernating_state : state
end
Kemal.run
And things started to get super weird randomly. We noticed that the Kemal server was not responding at all and we got errors such as accept: Too many open files (Socket::Error)
and No free connection...
.
It took me a while to reproduce the issue locally.
If the number of concurrent of requests is not too high (but higher than the pool size), the Redis pooler displays the No free connection
error (but not always) and at some point, the Kemal server freezes.
When the number of concurrent requests is way too high, the Kemal app quickly freezes completely, only displaying the logs of the first 6 requests (related to the pool size).
My guess (not confirmed yet) is that the Kemal app keeps (silently) accepting the requests and at some point, the unix system raises the Too many open files
exception.
Just to make things clear here, if I disable the call to Redis, I don't have any trouble running my benchmark.
So for now, we found the right spot regarding the pool size and the traffic. Obviously, when we'll receive an unexpected peak of traffic, our ping service will crash for sure... And this will be a shame to re-implement it in pure Ruby (Sinatra) since under normal circonstances, Kemal + Redis is super fast (average: 150µs of my laptop).
Thanks for your help and your awesome work !!!!
probably pool_size
is too small, try bigger: https://github.com/stefanwille/crystal-redis/blob/master/src/redis/pooled_client.cr#L29
thanks @kostya! That's absolutely the right solution (and this is what we did).
But even if the pool size was under estimated, this shouldn't freeze the Kemal server. At least, responding a 500 would be fine at least. For some reasons, the Redis pool seems to have a hold on the Kemal server (again I'm not expert).
I narrowed down my local tests so that I can reproduce every time the issue.
Context: same code as before with a pool size of 5 and a timeout of 5s. I run my Kemal server (KEMAL_ENV=production)l
Test 1
wrk -c 5 -d 2 -H "user-agent: benchmark" http://localhost:3000/ping
Note: (the -c 5
option is important, it's exactly my pool size).
👉 no problem at all. The performance are optimal.
Test 2
wrk -c 6 -d 2 -H "user-agent: benchmark" http://localhost:3000/ping
Note: (the -c 6
is my pool size + 1).
⚠️ not all the requests are processed. No error in the logs. If re-run the wrk command without restarting my Kemal server, the Kemal server doesn't process any request (no logs, no errors, nothing).
The expected behavior would be to get a "No free connections" exception and so a 500 response from the API.
also what version of crystal-redis, and pool you using? there was recently update in pool related to concurrency. may be added some bugs. try previous version:
dependencies:
redis:
github: stefanwille/crystal-redis
version: "2.8.3"
pool:
github: ysbaddaden/pool
version: "0.2.4"
@kostya awesome!!! Downgrading fixed my issues!
And now the behavior under "high traffic" is consistent in a sense the time taken by the app to respond just increases.
Thanks you so much @kostya!
@ysbaddaden ping, probably bug in pool 0.3.0
Oh no. The main difference is that it now uses Channel
and select
when the old version used an IO.pipe
and... there is now a mutex and it's unchecked :thinking:
~~@did could your report the problem to https://github.com/ysbaddaden/pool and it would be invaluable if you could provide a reduced reproducible test case :bow:~~ I may have found where the issue is (in the mutex).
Using this shard I got affected by https://github.com/ysbaddaden/pool/issues/2 and https://github.com/ysbaddaden/pool/issues/10 in the same project, so instead of using Redis::PooledClient
I wrote an API compatible replacement RedisPooledClient
😅. I'm still testing it, but so far it removed bad servers from the pool on redis errors and works like a pool is expected to work.
If there's interest I can create a MR patching Redis::PooledClient
, code that deal with threads always has it's gotchas, so more eyes to review it are welcome.
P.S.: I didn't add the subscribe methods because I don't use them in my project, but since they rely on with_pool_connection
they should work when CTRL+V
ed here.
class RedisPooledClient
@pool : Channel(Redis)
@pool_timeout : Time::Span
@free_slots_mutex = Mutex.new
@free_slots : Int32
@factory : Proc(Redis)
def initialize(*args, pool_size = 5, pool_timeout = 5.0, **args2)
@pool = Channel(Redis).new(pool_size)
@free_slots = pool_size
@pool_timeout = pool_timeout.seconds
@factory = ->{ Redis.new(*args, **args2) }
end
macro method_missing(call)
# Delegates all Redis commands to a `Redis` instance from the connection pool.
with_pool_connection(&.{{call}})
end
private def with_pool_connection
redis = @free_slots_mutex.synchronize do
if @free_slots.zero?
fetch_from_pool
else
@free_slots -= 1
@factory.call
end
end
yield(redis).tap do
@pool.send(redis)
end
rescue ex : Redis::PoolTimeoutError
# on PoolTimeoutError no redis was fetch, so @free_slots isn't increased.
raise ex
rescue ex : Redis::Error
@free_slots_mutex.synchronize { @free_slots += 1 }
raise ex
end
private def fetch_from_pool : Redis
select
when redis = @pool.receive
redis
when timeout(@pool_timeout)
raise Redis::PoolTimeoutError.new("No free connection after timeout of #{@pool_timeout}s")
end
end
end
I benchmarked this code on a AMD Ryzen 9 5900X 12-Core Processor
(without -Dpreview_mt):
Results:
$ echo "GET http://localhost:3000/ping" | vegeta attack -rate 10000 -duration=5s | tee results.bin | vegeta report
Requests [total, rate, throughput] 50000, 9998.67, 9989.93
Duration [total, attack, wait] 5.001s, 5.001s, 173.735µs
Latencies [min, mean, 50, 90, 95, 99, max] 80.821µs, 3.16ms, 115.616µs, 671.283µs, 22.954ms, 75.35ms, 211.4ms
Bytes In [total, mean] 0, 0.00
Bytes Out [total, mean] 0, 0.00
Success [ratio] 99.92%
Status Codes [code:count] 0:42 200:49958
Error Set:
Get "http://localhost:3000/ping": dial tcp 0.0.0.0:0->[::1]:3000: connect: connection refused
Code:
require "kemal"
require "redis"
class RedisPooledClient
@pool : Channel(Redis)
@pool_timeout : Time::Span
@free_slots_mutex = Mutex.new
@free_slots : Int32
@factory : Proc(Redis)
def initialize(*args, pool_size = 5, pool_timeout = 5.0, **args2)
@pool = Channel(Redis).new(pool_size)
@free_slots = pool_size
@pool_timeout = pool_timeout.seconds
@factory = ->{ Redis.new(*args, **args2) }
end
macro method_missing(call)
# Delegates all Redis commands to a `Redis` instance from the connection pool.
with_pool_connection(&.{{call}})
end
private def with_pool_connection
redis = @free_slots_mutex.synchronize do
if @free_slots.zero?
fetch_from_pool
else
@free_slots -= 1
@factory.call
end
end
yield(redis).tap do
@pool.send(redis)
end
rescue ex : Redis::PoolTimeoutError
# on PoolTimeoutError no redis was fetch, so @free_slots isn't increased.
raise ex
rescue ex : Redis::Error
@free_slots_mutex.synchronize { @free_slots += 1 }
raise ex
end
private def fetch_from_pool : Redis
select
when redis = @pool.receive
redis
when timeout(@pool_timeout)
raise Redis::PoolTimeoutError.new("No free connection after timeout of #{@pool_timeout}s")
end
end
end
redis = RedisPooledClient.new(url: ENV.fetch("REDIS_URL") { nil })
get "/ping" do |env|
env.response.content_type = "application/json"
redis.get("maintenance_started_at")
end
Kemal.run
@hugopl. You should not be affected by https://github.com/ysbaddaden/pool/issues/10, because pool was reverted to stable version 0.2.4
(just not use 0.3.0). Also there is should not be problem with https://github.com/ysbaddaden/pool/issues/2, because redis client just reconnects on dead connection. So I don't know what problem is?