zipkin icon indicating copy to clipboard operation
zipkin copied to clipboard

Change Kafka collection approach to represent storage backpressure

Open codefromthecrypt opened this issue 9 years ago • 12 comments
trafficstars

http and scribe collectors accept requests, returning early. A thread will later persist to storage or error. We need to do this as we don't want to make instrumentation block.

We use the same approach in kafka at the moment, but we might want to reconsider it.

Right now, a surge in writes to elasticsearch will result in errors, as elasticsearch threads drop messages it can't store. Instead of reading more than we know we can store.. why don't we slow down and consume from kafka at the rate we can store?

For example, we could implement backpressure by linking the storage operations to the kafka consumer threads. If storage operations slow down, the kafka queue will build up, something relatively easy to monitor.

Any thoughts on the idea or other alternatives?

codefromthecrypt avatar Jul 12 '16 08:07 codefromthecrypt

cc @eirslett @kristofa @prat0318 @liyichao @anuraaga @basvanbeek @abesto thoughts?

codefromthecrypt avatar Jul 12 '16 08:07 codefromthecrypt

For example, right now, if too many writes go to Elasticsearch, we get logs like this. Seems that in Kafka, we could have an opportunity to just read less..

org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.transport.TransportService$4@3b438fe4 on EsThreadPoolExecutor[index, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6ed30c21[Running, pool size = 8, active threads = 8, queued tasks = 200, completed tasks = 72]]

codefromthecrypt avatar Jul 12 '16 08:07 codefromthecrypt

Sounds like a good idea to me!

eirslett avatar Jul 12 '16 08:07 eirslett

Just curious, we are thinking modifying just kafka transport because it can handle throttling and none of http/scribe can, correct?

Overall, looks a good idea.

prat0318 avatar Jul 12 '16 18:07 prat0318

@prat0318 exactly

codefromthecrypt avatar Jul 13 '16 00:07 codefromthecrypt

Hi, any update ?

liyichao avatar Aug 05 '16 06:08 liyichao

Here's the current thinking.

So Kafka implies running zookeeper. We have an adaptive sampler which uses zookeeper to store the target rate that the storage layer is capable of. We could re-use this code to rate limit kafka consumers (and/or drop), without adding a new service dependency (because kafka already needs zookeeper).

There's a problem we might have, which is that kafka marks slow consumers dead (I hear).

thoughts?

codefromthecrypt avatar Aug 07 '16 02:08 codefromthecrypt

nows a good time to start working on this, since the elasticsearch code is stabilizing cc @openzipkin/elasticsearch

codefromthecrypt avatar Oct 06 '16 14:10 codefromthecrypt

easiest dead simple start on this is to make storage block the kafka stream loop

codefromthecrypt avatar Oct 06 '16 23:10 codefromthecrypt

Is there something 3rd lib can support Kafka with Zipkin?

FreeSlaver avatar Mar 14 '17 08:03 FreeSlaver

If you look at zipkin-aws and zipkin-azure in both cases you can add a custom collector. So yes you could make a third party kafka collector that does backpressure nicely. However it would be even nicer to contribute such a change upstream as all we are missing is hands to help!

On 14 Mar 2017 10:33, "宋鑫" [email protected] wrote:

Is there something 3rd lib can support Kafka with Zipkin?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openzipkin/zipkin/issues/1181#issuecomment-286354335, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD610BJdzbqwIHbf0JVZerha96nhgr_ks5rllDegaJpZM4JKK1i .

codefromthecrypt avatar Mar 15 '17 06:03 codefromthecrypt

Kafka will only hand you the messages as fast as you can receive them, so indeed, if backpressure is ever needed we could make a blocking version of the collector.

There's a problem we might have, which is that kafka marks slow consumers dead (I hear).

Indeed, this refers to https://kafka.apache.org/documentation/#max.poll.interval.ms . Default value is 5 minutes though, if the storage write from the latest message poll did not complete within 5 minutes you have other issues mkay.

jorgheymans avatar Aug 01 '20 20:08 jorgheymans