zipkin icon indicating copy to clipboard operation
zipkin copied to clipboard

Visualize large swaths of traces

Open codefromthecrypt opened this issue 5 years ago • 10 comments

From https://twitter.com/soini/status/1057642318365306880

For zipkin, if I had one wish, it'd be native visualization options for large swaths of traces.

Rational Currently, our only aggregate tool is the service dependency diagram. While we've discussed doing single trace aggregate (for a stable shape regardless of call count to redis for example), We've not discussed how to do other ways of visualizing large amounts of traces.

Example Scenario Hoping to hear about this from Joe so that we make the right thing!

Prior Art Joe is building a multi-level sankey diagram w/ Vega in Kibana in the interim.

  • Please note other prior art. *

codefromthecrypt avatar Oct 31 '18 23:10 codefromthecrypt

also related https://engineering.salesforce.com/anomaly-detection-in-zipkin-trace-data-87c8a2ded8a1

codefromthecrypt avatar Nov 01 '18 01:11 codefromthecrypt

because we currently need to understand if this is visualizing for the purpose of detecting anomalies or traffic flow or shape of call graphs, we'll want to see more input. Especially in netflix we discussed that people may want something like a service graph with more facts (such as cluster and zone), but we don't know exactly what's in mind without hearing more from Jon or others.

codefromthecrypt avatar Nov 01 '18 01:11 codefromthecrypt

From Jon on Twitter: https://twitter.com/soini/status/1058078887488315392

here's one of the viz's that's been kicking around in my head put (quickly) to paper. Being able to aggregate the node<>node flows, and then break them down by terms (error codes) or duration ranges would be cool, I think.

dq8myiaucaixl7a

codefromthecrypt avatar Nov 03 '18 08:11 codefromthecrypt

Chat with @narayuna offline I think is clarifying to me about how to phrase this effort.

So notes here aren't to say sanke is the way, as there are lots of details that need to be elaborated, not only why should something be built, but also who is it for, what affects the feature (related features, data things etc): it isn't a goal to implement an image, our goal is to serve users

Jon is a user who asked about this, and he has some experience to share, so we are welcome to take it. This is awesome as not everyone in Zipkin are experienced in the same ways. In OSS, some are happy to help with work needed to implement things they don't fully understand. We love experience dumps for this reason.. that's also why we hold workshops https://cwiki.apache.org/confluence/display/ZIPKIN/Workshops

To be realistic, focused elaboration of something like this is the expensive part, and this will involve a lot of folks. We may find we don't have the right data for some things and decide either to collect it or even not do that! There's a lot of unknown dependencies at this phase of thinking for us.

So, setting expectations that this might be a collection of thoughts until the right time to dig deeper. The "right time" part is likely after we've redone the UI as we are swamped right now. So, tactically we work on getting the current UI in shape. Strategically, we raise issues like this and hold workshops to collect a general sense of direction and people

Ideal outcome is that when hands are free of the tactical, we end up choosing, integrating or building the right thing for end users' benefit as told to us by them!

codefromthecrypt avatar Nov 03 '18 08:11 codefromthecrypt

https://github.com/openzipkin/zipkin/pull/2731 is a first step as it ports the basic dependency linker used in spark jobs to javascript

codefromthecrypt avatar Aug 03 '19 20:08 codefromthecrypt

It would be useful to be able to visualise percentile-based durations in some way. E.g. p99 of spans for a particular edge in the dependency graph have a duration of 123ms or less

msmsimondean avatar Oct 16 '19 16:10 msmsimondean

@jeqo I think only https://github.com/jeqo/zipkin-storage-kafka could produce the data for this, as it implies 100% collection and processing of it. Unless it is a small amount of traces which could be done client-side. That or a different system like haystack

codefromthecrypt avatar Oct 16 '19 23:10 codefromthecrypt

there may be some on-demand, but cached aggregation features in ES at the moment.. no idea.

codefromthecrypt avatar Oct 16 '19 23:10 codefromthecrypt

@adriancole not sure how I missed this issue :(

100% collection + late-sampling would be a good way to position the kafka-based storage.

Now that lens is in place I hope this issue gets bump again

jeqo avatar Mar 12 '20 23:03 jeqo

@jeqo I think only https://github.com/jeqo/zipkin-storage-kafka could produce the data for this, as it implies 100% collection and processing of it. Unless it is a small amount of traces which could be done client-side. That or a different system like haystack

@codefromthecrypt @jeqo I wonder if you could do this without 100% collection. 100% collection would be nice for a few reasons but I wonder whether it would be essential. I guess what I'm thinking is that sampled span duration percentiles would be ok with suitable large sample sizes. I've now got some custom Java code that implements span duration response times on top of Zipkin. It's very early days with the code I've got, I've not really drilled into the output much yet but what I'm seeing so far looks ok.

msmsimondean avatar Mar 12 '21 16:03 msmsimondean