rdkafka-ruby icon indicating copy to clipboard operation
rdkafka-ruby copied to clipboard

Using `partition_key` raises `Local: Timed out` errors

Open Draiken opened this issue 2 years ago • 10 comments

We've recently started using partition_key when producing messages and started seeing a big number of Rdkafka::RdkafkaError: Local: Timed out (timed_out) even with a small number of messages.

IDK if this value needs to be memoized (at least for a short time?) or if anything else needs to be tweaked in order to get it to work reliably.

We attempted to increase all the *.timeout.ms variables to no avail. They would take longer to raise errors, but they'd still raise them.

We were using an older version of librdkafka but even after updating to 0.11.1 the error persisted.

The specific line that was raising the error was this one: https://github.com/appsignal/rdkafka-ruby/blob/76e6221492c9dba8224412cf3f202e91574089e8/lib/rdkafka/metadata.rb#L20

I'm unsure if this is a bug on rdkafka-ruby, librdkafka or perhaps something misconfigured on our side. However we searched the documentation for anything that could lead us to a culprit but couldn't find anything. The error from librdkafka is rather generic: https://github.com/edenhill/librdkafka/blob/master/INTRODUCTION.md#error-local-time-out

Any help is appreciated!

Draiken avatar Mar 09 '22 18:03 Draiken

Does it happen when creating messages to existing topics or new topics (first message)?

mensfeld avatar Mar 09 '22 18:03 mensfeld

All existing topics. We have auto create set to false.

Draiken avatar Mar 09 '22 18:03 Draiken

To give a bit more context: we are using Heroku Kafka, so we don't have any access to logs on the Kafka side.

Draiken avatar Mar 09 '22 18:03 Draiken

Are you on the basic plan type? standard and extended plans can access kafka logs per their docs.

(Perhaps not helpful, but just clarifying that there are some mechanisms that allow access to logs when using heroku kafka).

geoff2k avatar Mar 09 '22 18:03 geoff2k

We are on standard but the logs were completely useless so far. They just tell us rate of consumption, iops, and things like that... example line of the only logs we get:

source=KAFKA addon=kafka-corrugated-39111 sample#load-avg-1m=0.0025 sample#load-avg-5m=0.005 sample#load-avg-15m=0 sample#read-iops=0 sample#write-iops=1.5339 sample#memory-total=16386156kB sample#memory-free=6966620kB sample#memory-cached=4228484kB sample#bytes-in-per-second=75.89 sample#bytes-out-per-second=5.74

Draiken avatar Mar 09 '22 19:03 Draiken

It just occurred to me that Heroku Kafka might be having an errror with the metadata request, but if we don't get a response within the 250ms timeout, we wouldn't see the real error, only the timeout.

According to their ACLs we should have Describe permission to Topic so it should not be a permission issue.

Draiken avatar Mar 09 '22 19:03 Draiken

Would you mind also telling me:

  1. rate at which you send data
  2. exact config you have for production
  3. average and max size of messages
  4. does it happen without a partition key usage?

mensfeld avatar Mar 09 '22 19:03 mensfeld

Of course!

  • Our producers were generating around 60 messages per second peak. But later we could reproduce the error with much lower rates
  • This one is a bit tricky since we tried a bunch, and Racecar wraps most of it. We saw this with all the default *.timeout.ms. The only values we changed from defaults were:
{
  "fetch.max.bytes" => 2000000,
  "auto.commit.interval.ms" => 2000,
}
  • This varies a ton (and we had issues with all message sizes) Max size I saw was round ~7kb Average among all topics is ~600b
  • No, after tracking down the bug we found the issue only happened after we started using partition_key since it triggered that Metadata call, which then resulted in these timeouts

Draiken avatar Mar 10 '22 13:03 Draiken

This sounds like it's probably not caused by the Ruby layer, but by some interaction between librdkafka and kafka. @leonmaia mentioned that there was a timeout in listing groups in #185. Could that maybe be related?

thijsc avatar Mar 10 '22 16:03 thijsc

FWIW: https://github.com/edenhill/librdkafka/issues/3705, there's a know issue in librdkafka that addresses some time out issues. The fix is only released in a RC1 release tho.

leonmaia avatar Mar 14 '22 11:03 leonmaia

This should be fixed, we are using 1.9.2 now. Let me know if there are still issues.

thijsc avatar Nov 08 '22 16:11 thijsc