crystal-redis icon indicating copy to clipboard operation
crystal-redis copied to clipboard

Weird behavior under "high traffic/concurrency"

Open did opened this issue 1 year ago • 10 comments

Hi all 👋,

A bit of context first, I'm a Ruby dev (so Crystal noob here) and I decided to use Kemal to implement our ping service (used by our mobile app to display a maintenance view). It's called by the mobile apps every 30s. Our first version was not using Redis at all and our Kemal server was serving up to 1000req/s flawlessly.

Lately, we decided to use Redis to control our Ping service. Our endpoint is super simple:

redis = Redis::PooledClient.new(url: ENV.fetch("REDIS_URL") { nil })

get "/ping" do |env|
  env.response.content_type = "application/json"
  redis.get("maintenance_started_at") ? hibernating_state : state
end

Kemal.run

And things started to get super weird randomly. We noticed that the Kemal server was not responding at all and we got errors such as accept: Too many open files (Socket::Error) and No free connection....

It took me a while to reproduce the issue locally. If the number of concurrent of requests is not too high (but higher than the pool size), the Redis pooler displays the No free connection error (but not always) and at some point, the Kemal server freezes. When the number of concurrent requests is way too high, the Kemal app quickly freezes completely, only displaying the logs of the first 6 requests (related to the pool size). My guess (not confirmed yet) is that the Kemal app keeps (silently) accepting the requests and at some point, the unix system raises the Too many open files exception.

Just to make things clear here, if I disable the call to Redis, I don't have any trouble running my benchmark.

So for now, we found the right spot regarding the pool size and the traffic. Obviously, when we'll receive an unexpected peak of traffic, our ping service will crash for sure... And this will be a shame to re-implement it in pure Ruby (Sinatra) since under normal circonstances, Kemal + Redis is super fast (average: 150µs of my laptop).

Thanks for your help and your awesome work !!!!

did avatar Mar 24 '23 11:03 did

probably pool_size is too small, try bigger: https://github.com/stefanwille/crystal-redis/blob/master/src/redis/pooled_client.cr#L29

kostya avatar Mar 24 '23 11:03 kostya

thanks @kostya! That's absolutely the right solution (and this is what we did).

But even if the pool size was under estimated, this shouldn't freeze the Kemal server. At least, responding a 500 would be fine at least. For some reasons, the Redis pool seems to have a hold on the Kemal server (again I'm not expert).

did avatar Mar 24 '23 11:03 did

I narrowed down my local tests so that I can reproduce every time the issue.

Context: same code as before with a pool size of 5 and a timeout of 5s. I run my Kemal server (KEMAL_ENV=production)l

Test 1

wrk -c 5 -d 2 -H "user-agent: benchmark" http://localhost:3000/ping

Note: (the -c 5 option is important, it's exactly my pool size).

👉 no problem at all. The performance are optimal.

Test 2

wrk -c 6 -d 2 -H "user-agent: benchmark" http://localhost:3000/ping

Note: (the -c 6 is my pool size + 1).

⚠️ not all the requests are processed. No error in the logs. If re-run the wrk command without restarting my Kemal server, the Kemal server doesn't process any request (no logs, no errors, nothing).

The expected behavior would be to get a "No free connections" exception and so a 500 response from the API.

did avatar Mar 24 '23 12:03 did

also what version of crystal-redis, and pool you using? there was recently update in pool related to concurrency. may be added some bugs. try previous version:

dependencies:
  redis:
    github: stefanwille/crystal-redis
    version: "2.8.3"
  pool:
    github: ysbaddaden/pool
    version: "0.2.4"

kostya avatar Mar 24 '23 12:03 kostya

@kostya awesome!!! Downgrading fixed my issues!

And now the behavior under "high traffic" is consistent in a sense the time taken by the app to respond just increases.

Thanks you so much @kostya!

did avatar Mar 24 '23 12:03 did

@ysbaddaden ping, probably bug in pool 0.3.0

kostya avatar Mar 24 '23 12:03 kostya

Oh no. The main difference is that it now uses Channel and select when the old version used an IO.pipe and... there is now a mutex and it's unchecked :thinking:

~~@did could your report the problem to https://github.com/ysbaddaden/pool and it would be invaluable if you could provide a reduced reproducible test case :bow:~~ I may have found where the issue is (in the mutex).

ysbaddaden avatar Mar 24 '23 14:03 ysbaddaden

Using this shard I got affected by https://github.com/ysbaddaden/pool/issues/2 and https://github.com/ysbaddaden/pool/issues/10 in the same project, so instead of using Redis::PooledClient I wrote an API compatible replacement RedisPooledClient 😅. I'm still testing it, but so far it removed bad servers from the pool on redis errors and works like a pool is expected to work.

If there's interest I can create a MR patching Redis::PooledClient, code that deal with threads always has it's gotchas, so more eyes to review it are welcome.

P.S.: I didn't add the subscribe methods because I don't use them in my project, but since they rely on with_pool_connection they should work when CTRL+Ved here.

class RedisPooledClient
  @pool : Channel(Redis)
  @pool_timeout : Time::Span
  @free_slots_mutex = Mutex.new
  @free_slots : Int32
  @factory : Proc(Redis)

  def initialize(*args, pool_size = 5, pool_timeout = 5.0, **args2)
    @pool = Channel(Redis).new(pool_size)
    @free_slots = pool_size
    @pool_timeout = pool_timeout.seconds
    @factory = ->{ Redis.new(*args, **args2) }
  end

  macro method_missing(call)
    # Delegates all Redis commands to a `Redis` instance from the connection pool.
    with_pool_connection(&.{{call}})
  end

  private def with_pool_connection
    redis = @free_slots_mutex.synchronize do
      if @free_slots.zero?
        fetch_from_pool
      else
        @free_slots -= 1
        @factory.call
      end
    end
    yield(redis).tap do
      @pool.send(redis)
    end
  rescue ex : Redis::PoolTimeoutError
    # on PoolTimeoutError no redis was fetch, so @free_slots isn't increased.
    raise ex
  rescue ex : Redis::Error
    @free_slots_mutex.synchronize { @free_slots += 1 }
    raise ex
  end

  private def fetch_from_pool : Redis
    select
    when redis = @pool.receive
      redis
    when timeout(@pool_timeout)
      raise Redis::PoolTimeoutError.new("No free connection after timeout of #{@pool_timeout}s")
    end
  end
end

hugopl avatar Mar 08 '24 14:03 hugopl

I benchmarked this code on a AMD Ryzen 9 5900X 12-Core Processor (without -Dpreview_mt):

Results:

$ echo "GET http://localhost:3000/ping" | vegeta attack -rate 10000 -duration=5s | tee results.bin | vegeta report
Requests      [total, rate, throughput]         50000, 9998.67, 9989.93
Duration      [total, attack, wait]             5.001s, 5.001s, 173.735µs
Latencies     [min, mean, 50, 90, 95, 99, max]  80.821µs, 3.16ms, 115.616µs, 671.283µs, 22.954ms, 75.35ms, 211.4ms
Bytes In      [total, mean]                     0, 0.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           99.92%
Status Codes  [code:count]                      0:42  200:49958  
Error Set:
Get "http://localhost:3000/ping": dial tcp 0.0.0.0:0->[::1]:3000: connect: connection refused

Code:

require "kemal"
require "redis"

class RedisPooledClient
  @pool : Channel(Redis)
  @pool_timeout : Time::Span
  @free_slots_mutex = Mutex.new
  @free_slots : Int32
  @factory : Proc(Redis)

  def initialize(*args, pool_size = 5, pool_timeout = 5.0, **args2)
    @pool = Channel(Redis).new(pool_size)
    @free_slots = pool_size
    @pool_timeout = pool_timeout.seconds
    @factory = ->{ Redis.new(*args, **args2) }
  end

  macro method_missing(call)
    # Delegates all Redis commands to a `Redis` instance from the connection pool.
    with_pool_connection(&.{{call}})
  end

  private def with_pool_connection
    redis = @free_slots_mutex.synchronize do
      if @free_slots.zero?
        fetch_from_pool
      else
        @free_slots -= 1
        @factory.call
      end
    end
    yield(redis).tap do
      @pool.send(redis)
    end
  rescue ex : Redis::PoolTimeoutError
    # on PoolTimeoutError no redis was fetch, so @free_slots isn't increased.
    raise ex
  rescue ex : Redis::Error
    @free_slots_mutex.synchronize { @free_slots += 1 }
    raise ex
  end

  private def fetch_from_pool : Redis
    select
    when redis = @pool.receive
      redis
    when timeout(@pool_timeout)
      raise Redis::PoolTimeoutError.new("No free connection after timeout of #{@pool_timeout}s")
    end
  end
end

redis = RedisPooledClient.new(url: ENV.fetch("REDIS_URL") { nil })

get "/ping" do |env|
  env.response.content_type = "application/json"
  redis.get("maintenance_started_at")
end

Kemal.run

hugopl avatar Mar 20 '24 03:03 hugopl

@hugopl. You should not be affected by https://github.com/ysbaddaden/pool/issues/10, because pool was reverted to stable version 0.2.4 (just not use 0.3.0). Also there is should not be problem with https://github.com/ysbaddaden/pool/issues/2, because redis client just reconnects on dead connection. So I don't know what problem is?

kostya avatar Mar 20 '24 12:03 kostya