rdkafka-ruby
rdkafka-ruby copied to clipboard
Using `partition_key` raises `Local: Timed out` errors
We've recently started using partition_key
when producing messages and started seeing a big number of Rdkafka::RdkafkaError: Local: Timed out (timed_out)
even with a small number of messages.
IDK if this value needs to be memoized (at least for a short time?) or if anything else needs to be tweaked in order to get it to work reliably.
We attempted to increase all the *.timeout.ms
variables to no avail. They would take longer to raise errors, but they'd still raise them.
We were using an older version of librdkafka
but even after updating to 0.11.1
the error persisted.
The specific line that was raising the error was this one: https://github.com/appsignal/rdkafka-ruby/blob/76e6221492c9dba8224412cf3f202e91574089e8/lib/rdkafka/metadata.rb#L20
I'm unsure if this is a bug on rdkafka-ruby
, librdkafka
or perhaps something misconfigured on our side. However we searched the documentation for anything that could lead us to a culprit but couldn't find anything. The error from librdkafka
is rather generic: https://github.com/edenhill/librdkafka/blob/master/INTRODUCTION.md#error-local-time-out
Any help is appreciated!
Does it happen when creating messages to existing topics or new topics (first message)?
All existing topics. We have auto create set to false.
To give a bit more context: we are using Heroku Kafka, so we don't have any access to logs on the Kafka side.
Are you on the basic
plan type? standard
and extended
plans can access kafka logs per their docs.
(Perhaps not helpful, but just clarifying that there are some mechanisms that allow access to logs when using heroku kafka).
We are on standard
but the logs were completely useless so far. They just tell us rate of consumption, iops, and things like that... example line of the only logs we get:
source=KAFKA addon=kafka-corrugated-39111 sample#load-avg-1m=0.0025 sample#load-avg-5m=0.005 sample#load-avg-15m=0 sample#read-iops=0 sample#write-iops=1.5339 sample#memory-total=16386156kB sample#memory-free=6966620kB sample#memory-cached=4228484kB sample#bytes-in-per-second=75.89 sample#bytes-out-per-second=5.74
It just occurred to me that Heroku Kafka might be having an errror with the metadata request, but if we don't get a response within the 250ms
timeout, we wouldn't see the real error, only the timeout.
According to their ACLs we should have Describe
permission to Topic
so it should not be a permission issue.
Would you mind also telling me:
- rate at which you send data
- exact config you have for production
- average and max size of messages
- does it happen without a partition key usage?
Of course!
- Our producers were generating around 60 messages per second peak. But later we could reproduce the error with much lower rates
- This one is a bit tricky since we tried a bunch, and Racecar wraps most of it. We saw this with all the default
*.timeout.ms
. The only values we changed from defaults were:
{
"fetch.max.bytes" => 2000000,
"auto.commit.interval.ms" => 2000,
}
- This varies a ton (and we had issues with all message sizes) Max size I saw was round ~7kb Average among all topics is ~600b
- No, after tracking down the bug we found the issue only happened after we started using
partition_key
since it triggered thatMetadata
call, which then resulted in these timeouts
This sounds like it's probably not caused by the Ruby layer, but by some interaction between librdkafka and kafka. @leonmaia mentioned that there was a timeout in listing groups in #185. Could that maybe be related?
FWIW: https://github.com/edenhill/librdkafka/issues/3705, there's a know issue in librdkafka that addresses some time out issues. The fix is only released in a RC1 release tho.
This should be fixed, we are using 1.9.2 now. Let me know if there are still issues.