opentelemetry-collector-contrib icon indicating copy to clipboard operation
opentelemetry-collector-contrib copied to clipboard

Monitor message flow in kafka

Open gfonseca-tc opened this issue 3 years ago • 5 comments

Is your feature request related to a problem? Please describe. Monitoring event driven environments can be tricky. Besides the system aspects, that can be monitored using existing tools, there is also the business aspects that need more logic to be monitored. How can we follow the track of a message that goes through lots of applications and topics? How can we alert on a missing part of an event driven trace? I would like to have an "observer" that was able of emitting a metric saying if a certain message is on right track or not, if it has passed into the right topics and so on. I'm not sure if the collector is the right place for that, but the way I imagine it would be something like the prometheus receiver.

Describe the solution you'd like A receiver that I can configure a sequence of topics that a message has to go through during a time frame. I would configure a time window, the list of topics, the id of the message and where to find it in the message. The receiver would listen for those topics and check if the messages are following the expected path during the configured time window. It would emit a metric informing if a message is doing ok or missing some topic, or delaying to reach some point.

Describe alternatives you've considered We've considered listening to dead-letter topics and checking what ends up there, but it might not alert us soon enough or might not be clear where in the path it failed. We've also considered looking into incoming vs outgoing messages, but it does not tell us which message has failed and where.

gfonseca-tc avatar Feb 07 '22 11:02 gfonseca-tc

@fangyi-zhou, I think you might be able to help here :-)

jpkrohling avatar Feb 07 '22 12:02 jpkrohling

Hi :wave:

I'm a PhD student at Imperial College London, we're investigating how to OpenTelemetry and Multiparty Session Types for monitoring protocol conformance. Currently there's only a primitive prototype, but you can have a look at some slides or a (possibly outdated) writeup

Please don't hesitate to get in touch if you're interested!

/cc @fferreira

fangyi-zhou avatar Feb 07 '22 13:02 fangyi-zhou

Wow! Pretty cool what you are doing @fangyi-zhou! I'll give a better look but it seems to cover more complex cases than what I imagined. I was thinking in a pretty straightforward approach, something like setting pairs of topics to listen and a common key to verify. The "observer" would then listen for the messages and keep track of the IDs to check if the message that passed through A reached B in the given time frame. I will read through all you sent me and come back to see how I can help you! Thanks for the prompt response!

gfonseca-tc avatar Feb 07 '22 14:02 gfonseca-tc

@gfonseca-tc ,

This is a really interesting topic and it is something my team has been thinking about. I imagine our use cases are more or less aligned in this so I am happy to share what our thoughts are on this.

I imagine a lot of your desire is to be able to confidently set SLO/SLIs on your ability to process incoming telemetry data and track the error rate of undelivered messages?

How can we follow the track of a message that goes through lots of applications and topics?

I really think you should start by defining what your SLO/Is are since you could measure from the client (taking a look at finished / stopped time) by adding an additional processor (external from otel at this point) that could read from the same topic and calculate the approximate time that client took to send (the telemetry end time) and look at the commit time (the time it was written to Kafka, I believe this is stored as part of Kafka not within the message so I'd need to check).

I am gonna say a rather "simple" way to calculate end to end latency provided the assumptions stated are there, and you can just emit that data back into the collector that sends to a meta monitoring environment (though, I do highly recommend keeping the meta environment, the environment storing all data on the data, as simple as possible because it is nightmareish making things complicated with feature creep and the potential cost incurred as well).

How can we alert on a missing part of an event driven trace?

This would need to be done as a post analysis since consolidating traces requires the client, server / sender, receiver to finish their transaction and forward it to a central processing point where consolidation of the data can happen.

(@jpkrohling has way deeper knowledge in this area so am completely happy to be corrected here or to be shared an amazing "online" process that doesn't require a centralised processing to validate the complete trace event)

Though, simply defining a window of an "acceptable time" of when all trace data must be within the system, and you can do some meta analysis (check to see which service is failing to send its trace/span data, check how much of the trace/span is missing, or whatever matters to you that is possible to do with the data).

I really like you idea here:

A receiver that I can configure a sequence of topics that a message has to go through during a time frame

but I would want to generalise and extend it to the entire pipeline so you can calculate end to end latency within the collector itself.

Also, by no means am I intended to discredit or gloss over @fangyi-zhou work, I think it is really awesome to see such a validation system being built on top of OpenTelemetry (and by all means, if I can help, please feel free to reach out on the CNCF slack space), but these were my thoughts and what my teams approach to trying to solve this "meta" observability problem.

MovieStoreGuy avatar Feb 16 '22 06:02 MovieStoreGuy

to be shared an amazing "online" process that doesn't require a centralised processing to validate the complete trace event

I too cannot think of a way to do that without a centralized system. The problem is that each individual event is sent by the originating service, which might be batched with different parameters from other similar services. So, "operation-1" span might reach the storage after its child, "operation-2", even if they are synchronous operations.

Though, simply defining a window of an "acceptable time" of when all trace data must be within the system, and you can do some meta analysis

That's basically how the "group by trace" processor does: you tell it to hold data in memory for a specific amount of time, and the next consumer in the pipeline will receive a trace with all the received spans: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/groupbytraceprocessor . You can then create your own processor that would be part of the pipeline after the "group by trace" one.

That said, the typical way of doing this is indeed by doing some sort of post-analysis job on the tracing data.

jpkrohling avatar Feb 17 '22 13:02 jpkrohling

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

github-actions[bot] avatar Nov 16 '22 03:11 github-actions[bot]

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

github-actions[bot] avatar Jun 26 '23 03:06 github-actions[bot]

Closing, if this is still relevant, feel free to reopen.

jpkrohling avatar Jul 05 '23 13:07 jpkrohling