jaeger-ui icon indicating copy to clipboard operation
jaeger-ui copied to clipboard

WIP: Distinguish traces by service namespace and instance.id

Open nikclayton-dfinity opened this issue 3 years ago • 17 comments

When displaying / distinguishing traces by name or colour consider the service.namespace and service.instance.id tags if present.

This helps distinguishes traces that span multiple processes that make up the same logical service where different instances have different IDs.

Which problem is this PR solving?

https://github.com/jaegertracing/jaeger-ui/issues/721

Short description of the changes

This is not yet ready to merge, I've opened the PR because I think there are a few ways to proceed and I wanted to get consensus on whatever approach is acceptable to the Jaeger maintainers. I'll add comments to this PR with additional information.

nikclayton-dfinity avatar Apr 02 '21 09:04 nikclayton-dfinity

First, here's what Jaeger does with my service without this PR. In this example the trace spans two different processes that are part of the same service, and they are gossiping information between them.

In this set of screenshots (without this PR) Jaeger only looks at the service name, and doesn't distinguish between the different instances involved.

Screenshot 2021-04-02 at 11 24 30 Screenshot 2021-04-02 at 11 24 50 Screenshot 2021-04-02 at 11 25 02

nikclayton-dfinity avatar Apr 02 '21 09:04 nikclayton-dfinity

This is the same thing, but with this PR.

Screenshot 2021-04-02 at 11 25 15

The first result shows two entries for the same service, and includes the namespace (empty in this example) and the instance ID, separated by :.

Screenshot 2021-04-02 at 11 25 20

The span graph and timeline viewer distinguish spans from the two different instances with different colours.

Screenshot 2021-04-02 at 11 25 28

The trace graph distinguishes spans from the two different instances with different colours.

nikclayton-dfinity avatar Apr 02 '21 09:04 nikclayton-dfinity

The service is replica, and each instance of that service is reporting a different service.instance.id tag value (they don't have a service.namespace).

These fields are from the OpenTelemetry resource semantic conventions at https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/resource/semantic_conventions#service.

nikclayton-dfinity avatar Apr 02 '21 09:04 nikclayton-dfinity

The code's the simplest thing to prove the concept, but isn't production ready -- some code is duplicated, some information is passed lower down the call tree than I'd like, and it can probably be made more general.

I think the best solution is probably to:

  1. Introduce a new class, Service (maybe in types/traces.tsx) that can be initialised from a Span (Service is a very generic name, perhaps SpanService or something else would be better?)
  2. Add methods to this type to return the name, namespace, and instance.id tags
  3. Add a display method to this type that returns ${name}:${namespace}:${instance_id}
  4. Add color_hex and color_rgb methods to this type that returns the correct color to use
  5. Pass an instance of this new object everywhere the serviceName string is currently expected.

Another way to achieve a similar thing would be to do the above, but instead of a Service class, convert the existing Process type in to a class with those methods -- it already holds the serviceName field so this would be a natural extension of that. And then pass that down.

Or do that, but on the existing Span type (convert it to a class, add methods, etc).

Or something else? I'm not very familiar with the existing design decisions in the code, so I'm happy to take whatever approach the maintainers think will best fit.

Optional other cleanup work, either in this PR, or a followup:

I had to spend a bit of time dealing with the different representations of color -- sometimes it's a hex string, sometimes it's an RGB triple. A Color type with as_hex and as_rgb methods might be useful.

The type CanvasSpanGraphProps has an items field which duplicates the type definition of SpanItem. It wasn't clear why it doesn't use SpanItem directly.

nikclayton-dfinity avatar Apr 02 '21 09:04 nikclayton-dfinity

cc @albertteoh

yurishkuro avatar Apr 02 '21 14:04 yurishkuro

You seem to be making color decisions by default, based on a specific tag that may not be applicable to every user.

Sort of, but in exactly the same way that Jaeger currently does it.

Conceptually, Jaeger takes all the spans from a trace, and puts each span in to a group. The group gets a colour that Jaeger determines.

Before this PR, Jaeger puts spans in to groups based on the span's service name.

At it's heart, all this PR does is change the logic so that instead of ${serviceName} being the string that's used to determine the group, ${serviceName}:${serviceNamespace}:${serviceInstanceId} is the string that's used to determine the group.

So if spans don't have a namespace and instance ID then the behaviour is unchanged (you can see that in the first set of screenshots). Everything goes in to one group.

If the spans do have a namespace and/or instance ID then you get more groups.

But the mechanism that Jaeger uses to assign a colour to a group is not changed by this PR.

nikclayton-dfinity avatar Apr 02 '21 19:04 nikclayton-dfinity

Sort of, but in exactly the same way that Jaeger currently does it.

I understand that, but the main difference is that the service.name is a fundamental property in Jaeger data model, you can't even use Jaeger SDKs without providing a service name, so treating is specially in the UI it justified. The properties you're considering are not standard, nor produced by many instrumentations. On the other hand, the actual coloring mechanism is generic, not dependent on the particular span attributes, so hardcoding the exact list of its attributes for color coding doesn't sit well with me. At minimum it should be site-configurable, & ideally it should be use-controlled (e.g. what if I do have those attributes but I don't want to see different colors for the same service?)

yurishkuro avatar Apr 02 '21 20:04 yurishkuro

The properties you're considering are not standard, nor produced by many instrumentations.

That's true -- they're an experimental recommendation from a sibling CNCF project.

I'm coming from the perspective that I do think it's reasonable to expect (and as a user trying to use these projects in production) that different CNCF projects should, where practical, out-of-the-box, integrate cohesively with one another.

So it would be reasonable for Jaeger to -- with no additional user configuration required -- recognise and respond to OpenTelemetry semantic conventions on spans.

Whereas if there were other conventions recommended by non-CNCF projects, it would be reasonable for the user to have to explicitly configure Jaeger to support those conventions.

Again, with the caveat that these OpenTelemetry semantic conventions are currently marked as experimental.

On the other hand, the actual coloring mechanism is generic, not dependent on the particular span attributes, so hardcoding the exact list of its attributes for color coding doesn't sit well with me.

I don't think this is true. As you note earlier, Jaeger already hardcodes the exact list of attributes used for color coding (specifically, it only uses one, the service name). This change (in whatever its final form looks like) just extends that list.

At minimum it should be site-configurable, & ideally it should be use-controlled (e.g. what if I do have those attributes but I don't want to see different colors for the same service?)

and @albertteoh wrote:

Moreover, a configurable span colouring scheme could even open up interesting use cases that allow users to determine their own colouring scheme based on other span attributes (e.g. by deployment region instead of instance), that may even be completely independent of service names.

There are a few open issues that touch on this:

  • https://github.com/jaegertracing/jaeger-ui/issues/582
  • https://github.com/jaegertracing/jaeger-ui/issues/121

which have been open for some time (almost 3.5 years in one case) with no code that implements either of them being written.

It's also not something I've got the time to tackle at the moment.

OTOH, my PR, when cleaned up, would:

  • Fix this particular use case
  • Put the code for determining the grouping in a single place, making subsequent work for #582 or #121 more tractable (the code is probably not the difficult bit there, deciding how the user will configure it is)

nikclayton-dfinity avatar Apr 03 '21 15:04 nikclayton-dfinity

It's a really interesting idea to have more expressive heuristics around span coloring. I've always thought using service name or anything with similar cardinality is a losing battle.

Great point that this hasn't changed since pre v1.0 (IIRC).

OTOH, we now have the ability to specify the UI configuration via JaveScript, which makes it a ton easier to empower interesting configurations.

I agree with Yuri's points, and making it configurable and unchanged unless configuration indicates otherwise seems like a good idea, keeping the current behavior and opting in for the enhancement, for now at least, seems practical.

It's be nice to have it user controlled, as in per user, but that's a lot more work than say making it an element of the ui configuration, which can now be a JS file, and is therefore low hanging fruit for this type of thing. 🥝

A simple field on JS configurations like the following might work:

getSpanViewModifiers(...): { color?: number | string, colorId?: string } | void

The thinking for the fields is: color is either a number which is the actual color or a valid color according to CSS styling spec, eg "#c00" etc.

The color ID would just be any value and would have a color associated with the ID in a consistent manner, based on internal determinations of what to color a given ID, ie akin to a colorHash(...) so to speak.

And, the function would generally return either color or a color ID, not both.

Alternatively, and more conservatively, the config could specify a pattern that would determine the color ID for the span. E.g.

#{span service}|#{span.operation}

Or

#{span.tags.component}
#{soan.process.tags.instanceId}

The function approach has the benefit of being able to do things like color based on error or potentially even duration. It has more risk re expanding api contract, though, and we would probably want to pass a simple view onto a span not the span itself if we were to go this route.

Another thing, with these approaches we could create presets, so to speak, so the JS config could have minimal work required to fit your use case of the OTEL semantics, given the version, eg

{
  // etc...
  viewModifiers: { span: ['preset:otel-v0.1.2', 'process'] }
}

Anyway, a few alternative ideas — what do you think?

tiffon avatar Apr 03 '21 16:04 tiffon

Now that I've re-read this, I can see the PR is also proposing changes to various aggregations, like the service tags in the search results. That makes sense in that how can you have one aggregated data point that has more than one visual representation. OTOH, it makes the stakes a lot higher.

I still recommend leveraging JS config for custom stuff and also proving presets for common scenarios like OTEL, istio,etc. But, if this is to change aggregations in search results it should change all of them, consistently, and therefore the trace DAG, diffs, and DDG, too.

That's a large scope. Is there a simplification that would get it to 80% of your use case or maybe more without being so sweeping in scope?

tiffon avatar Apr 03 '21 16:04 tiffon

@tiffon

OTOH, we now have the ability to specify the UI configuration via JaveScript,

Did you actually mean JavaScript there, or just JSON?

That's a large scope. Is there a simplification that would get it to 80% of your use case or maybe more without being so sweeping in scope?

Probably not. In many ways the colour change is a bit of a red herring (no pun intended). The actual change is how Jaeger groups spans within a trace. The colouring changes naturally fall out of that.

So, here are the two options I think:

1: Allow the user to specify the span grouping through the config

{
  // etc...
  spanGroupKey: { fields: [...] }
}

?

The current behaviour would be:

{
  // etc...
  spanGroupKey: { fields: ['service.name'] }
}

The otel behaviour would be:

{
  // etc...
  spanGroupKey: { fields: ['service.name', 'process.service.namespace', 'process.service.instance.id'] }
}

Pro:

  • Most general approach
  • Resolves two other open issues

Con:

  • More code, unrelated to the specific problem I'm trying to solve
  • Doesn't do the right thing out of the box.

2: Don't make this user configurable yet, but do make grouping decisions based on other properties of the span.

For example, an otel span must have an otel.library.name field. So if that field exists then Jaeger could automatically group by the name/namespace/id triple.

Pro:

  • Simpler to implement
  • Solves my specific problem
  • Does the right thing for OpenTelemetry spans with no additional user configuration

Con:

  • The two other open issues stay open

nikclayton-dfinity avatar Apr 06 '21 09:04 nikclayton-dfinity

Requested re-review of the most recent comments on the options for proceeding.

nikclayton-dfinity avatar Apr 09 '21 13:04 nikclayton-dfinity

I would vote for a version of option 1. There are two main changes required by your use case:

  1. the ability to provide a lambda for determining the color of a node instead of hardcoding it to service.name. This sounds like the main chunk of work. I am not sure if it's possible to have a single API for such lambda, i.e. what would be the argument to it, the Span object, or something else?
  2. the ability to configure the lambda in JSON or JS configuration.

I don't agree with your assessment that Option 1 "Doesn't do the right thing out of the box" - it does, because the "right thing" (while subjective) is to be backwards compatible with the existing UI behavior by default, not to change it. People need to opt-in into the new behavior.

Regarding configuration, your proposal spanGroupKey: { fields: ['service.name'] } LGTM, however, as @tiffon mentioned, I would start simpler and provide just a few pre-defined configurations like service-name and otel-service-instance instead of building a fully dynamic config that must parse field names from the array.

yurishkuro avatar Apr 10 '21 18:04 yurishkuro

PTAL.

No documentation or test updates yet, I'll tackle those when there's broad approval for this approach.

To try it out, edit default-config.tsx and add a spanGroupKey property, either:

spanGroupKey: { preset: 'otel-ef4612d' },

or

spanGroupKey: { tags: ['serviceName', 'service.namespace', 'service.instance.id'] },

(those two are functionally the same thing) and the UI will group and colour spans according to those fields, in that order. Obviously, you'll need some traces with spans that set those fields. If you don't then adjust the value for tags as necessary.

BTW, using yarn start is there a straightforward way to pass in a path to a config file to load? I couldn't see any after a cursory check, and couldn't find anything mentioned in the documentation.

Some things I'm still not overly happy about, and would appreciate suggestions for:

  • The property name spanGroupKey. It's functional, but maybe there's a better name.

  • I add a group property to the Span type, which is a very generic name. Maybe there's something less generic that would be a better fit?

nikclayton-dfinity avatar Apr 13 '21 22:04 nikclayton-dfinity

Ping @yurishkuro @tiffon @albertteoh

nikclayton-dfinity avatar Apr 16 '21 16:04 nikclayton-dfinity

@albertteoh do you have someone who can review this?

I've reached out to my team to see if someone can help take a look.

albertteoh avatar May 03 '21 01:05 albertteoh

I've reached out to my team to see if someone can help take a look.

@Danafrid, @th3M1ke or @yoave23 please take a look when you get a chance.

albertteoh avatar May 22 '21 05:05 albertteoh