apm Store upstream paths in transactions/spans for service maps

We currently walk traces (via a scripted metric aggregation) to get paths/connections between services. However, that's untenable for a couple of reasons:

We need to select traces to inspect first, and then walk the traces. This can be slow in many cases, and it's unpredictable.
A scripted metric aggregation is a foot-gun, and it might be removed from the default distribution in the future, meaning we can no longer rely on it in the APM app.
It depends on the presence of spans, meaning we can't just purely use (transaction, span) metrics to power the UI.

One solution is to store (hashed) paths in transaction or span metrics, per @axw's suggestion.

Here's how that could possibly work:

Each service propagates a hash that uniquely identifies the service + the upstream path. Meaning, the root service A propagates a hash of just service A, service B propagates a hash of service A + service B, and so forth.
This hash is propagated via the tracestate header.
These hashes are also stored on transactions/spans (or the derived metrics).

We should consider the following use cases when deciding where and how to store the hashed paths:

Global service maps
Filtered service maps (e.g., by service name or environment)
Dependency statistics (i.e., metrics for one service directly talking to another service/external dependency)

One requirement is that we should be able to resolve all connections with one or two requests, without using a scripted metric aggregation.

Nov 02 '20 13:11 dgieselaar

Right now the assumption is that we should store these paths on both spans and transactions:

On spans, store the hash as it is propagated (so including its own hash). This allows us to map a point in a path to a service name, via its hash.
On transactions, store the hash that was received via the tracestate header. This allows us to build paths.

Let's suppose we the following service map:

We can describe it with the following events:

[
  { "processor.event": "transaction", "service.name": "a" },
  { "processor.event": "span", "service.name": "a", "span.destination.service.resource": "service-b:3000", "span.destination.hash": "hashed-service-a", "event.outcome": "success" },
  { "processor.event": "transaction", "service.name": "b", "transaction.upstream.hash": "hashed-service-a" },
  { "processor.event": "span", "service.name": "a", "span.destination.service.resource": "service-c:3001", "span.destination.hash": "hashed-service-a", "event.outcome": "success" },
  { "processor.event": "transaction", "service.name": "c", "transaction.upstream.hash": "hashed-service-a" },
  { "processor.event": "span", "service.name": "b", "span.destination.service.resource": "proxy:3002", "span.destination.hash": "hashed-service-a-b", "event.outcome": "success" },
  { "processor.event": "transaction", "service.name": "d", "transaction.upstream.hash": "hashed-service-a-b" },
  { "processor.event": "span", "service.name": "c", "span.destination.service.resource": "service-d:3003", "span.destination.hash": "hashed-service-a-c", "event.outcome": "success" },
  { "processor.event": "transaction", "service.name": "d", "transaction.upstream.hash": "hashed-service-a-c" },
  { "processor.event": "span", "service.name": "b", "span.destination.service.resource": "postgres:3004", "span.destination.hash": "hashed-service-a-b", "event.outcome": "failure" },
  { "processor.event": "span", "service.name": "d", "span.destination.service.resource": "postgres:3004", "span.destination.hash": "hashed-service-a-c-d", "event.outcome": "success" }
]

To get the global service map:

A composite aggregation on transaction.upstream.hash, span.destination.hash, span.destination.service.resource and service.name returns the following data:

  {
    "key" : {
      "span.destination.hash" : null,
      "transaction.upstream.hash" : null,
      "service.name" : "a",
      "span.destination.service.resource" : null
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : null,
      "transaction.upstream.hash" : "hashed-service-a",
      "service.name" : "b",
      "span.destination.service.resource" : null
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : null,
      "transaction.upstream.hash" : "hashed-service-a",
      "service.name" : "c",
      "span.destination.service.resource" : null
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : null,
      "transaction.upstream.hash" : "hashed-service-a-b",
      "service.name" : "d",
      "span.destination.service.resource" : null
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : null,
      "transaction.upstream.hash" : "hashed-service-a-c",
      "service.name" : "d",
      "span.destination.service.resource" : null
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : "hashed-service-a",
      "transaction.upstream.hash" : null,
      "service.name" : "a",
      "span.destination.service.resource" : "service-b:3000"
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : "hashed-service-a",
      "transaction.upstream.hash" : null,
      "service.name" : "a",
      "span.destination.service.resource" : "service-c:3001"
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : "hashed-service-a-b",
      "transaction.upstream.hash" : null,
      "service.name" : "b",
      "span.destination.service.resource" : "postgres:3004"
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : "hashed-service-a-b",
      "transaction.upstream.hash" : null,
      "service.name" : "b",
      "span.destination.service.resource" : "proxy:3002"
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : "hashed-service-a-c",
      "transaction.upstream.hash" : null,
      "service.name" : "c",
      "span.destination.service.resource" : "service-d:3003"
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : "hashed-service-a-c-d",
      "transaction.upstream.hash" : null,
      "service.name" : "d",
      "span.destination.service.resource" : "postgres:3004"
    },
    "doc_count" : 1
  }
]

We can then construct the paths by mapping transaction.upstream.hash to span.destination.hash, which will give us connections and paths between services. There are also requests to external services - these leaf nodes can be found by looking for values of span.destination.hash that don't have a corresponding bucket with the same value for transaction.upstream.hash.

For dependency metrics (e.g. request rate from service A to service B or service A to postgres), we should filter the documents on service.name to get the metrics, then an additional request to map the values of span.destination.hash and span.destination.service.resource to either a service name (via transaction.upstream.hash), or - if it's not found - to an external dependency.

Nov 03 '20 14:11 dgieselaar

We should also look into how this affects the cardinality of transaction/span metrics.

Nov 03 '20 14:11 dgieselaar

@dgieselaar nice!

Say we replaced D with two identical services D1 and D2, and say the proxy load-balances across them. In that case we would have a one-to-many relation from upstream hashed-service-a-b to service names "d1" and "d2". What would we do about span metrics from "b" then? Just show them for the edge between "b" and the proxy, but not from the proxy downstream?

Nov 04 '20 01:11 axw

Tagging @AlexanderWert who has experience building a metrics-based service map.

Nov 04 '20 07:11 felixbarny

@axw what are two identical services D1 and D2 that are interchangeable (load-balanced)? Shouldn't this be considered a wrong setup where user should be advised to set the same service name "D" for both and rely on service.node.name to distinguish between them?

Besides, in @dgieselaar's aggregation example, if they do have different service names configured, the combined key will be different, thus the count can be done separately, or did I misunderstand this?

Nov 04 '20 08:11 eyalkoren

@axw what are two identical services D1 and D2 that are interchangeable (load-balanced)? Shouldn't this be considered a wrong setup where user should be advised to set the same service name "D" for both and rely on service.node.name to distinguish between them?

Sorry, I meant identical in terms of their input/output and interaction with other services, not necessarily the exact same code. They could be two implementations of a service (e.g. you're migrating from a Java to a Go implementation :trollface:), and might have slightly different service.names. Alternatively they could be two instances of exactly the same service, but running in different service.environments (not sure if that should also be included in the hash?)

Besides, in @dgieselaar's aggregation example, if they do have different service names configured, the combined key will be different, thus the count can be done separately, or did I misunderstand this?

From b's perspective: we know we have a path a -> b -> proxy
From d's perspective: we know we have a path a -> b -> d (edges may be indirect, i.e. going through a non-instrumented proxy; d doesn't know about proxy)

If we introduce d2:

From d2's perspective: we know we have a path a -> b -> d2 (d2 doesn't know about proxy)

... and b still doesn't know about either d or d2.

What would we show on the edges proxy -> d and proxy -> d2?

Nov 04 '20 08:11 axw

Sorry, I meant identical in terms of their input/output and interaction with other services, not necessarily the exact same code. They could be two implementations of a service ...

Regardless, I believe that any interchangeable nodes (ones that can be load-balanced) should belong to the same service in our terminology and concepts. Any other filtering/aggregation should rely on other data like agent type, environment or node name.

From d2's perspective: we know we have a path a -> b -> d2 (d2 doesn't know about proxy)

I see. Will this be solved if b includes the proxy in the path it sends through the tracestate (meaning - a -> b -> proxy instead of only a -> b)? Alternatively, send the destination in addition to the path hash.

Nov 04 '20 08:11 eyalkoren

Actually, without this, how would there even be edges proxy -> d and proxy -> d2? Based on what info?

Nov 04 '20 09:11 eyalkoren

@axw:

Say we replaced D with two identical services D1 and D2, and say the proxy load-balances across them. In that case we would have a one-to-many relation from upstream hashed-service-a-b to service names "d1" and "d2". What would we do about span metrics from "b" then? Just show them for the edge between "b" and the proxy, but not from the proxy downstream?

I didn't intend for the proxy to be shown on the actual service map, my bad. We would ignore it, as we have a match for a span.destination.hash and transaction.upstream.hash, so we would consider it a direct connection between two services.

In this example, I think we could show a split edge from service C to D1/D2, and show the edge metrics once, if that makes sense.

Alternatively they could be two instances of exactly the same service, but running in different service.environments (not sure if that should also be included in the hash?)

Agree that service.environment should be included in the hash, and in the composite aggregation.

Nov 04 '20 13:11 dgieselaar

@felixbarny thank you for looping me in. I just wanted to drop in a different idea / approach to realize the service map purely on metric data, thus detaching it from the need of collecting 100% of traces / spans, etc. Feels related to this issue.

The concept is quite simple, based on the following:

As described above, each service would propagate the its own service name (or a hash, doesn't matter)
The called services reads the propagated information and enriches the existing transaction metrics with a "origin" tag.

We would get a set of metrics with the following conceptual structure (here illustrated as a table):

These metrics represent in their tags (origin-service, service) bi-leteral dependencies between services, so they can be used to reconstruct a graph / service map with corresponding metric values attached.

This is just the core idea, if it is of interest I can elaborate more on the details.

With some additional context propagation and tagging of metrics, this approach is quite powerful, and allows for the following (while it is highly scaleable in terms of data collection and query/ data processing):

filtering of the service map based on arbitrary flow characteristics (business transactions, application units, users, etc.)
aggregation of the service map on different levels (application level, service level, node/instance level, region, etc.)
calculation of edge metrics (response times, load)
handling of calls to external services in a similar manner

Nov 04 '20 15:11 AlexanderWert

@dgieselaar

In this example, I think we could show a split edge from service C to D1/D2, and show the edge metrics once, if that makes sense.

If I understand correctly, we would have something like (apologies, I do not have @felixbarny's ASCII art mastery):

    (edge metrics here, no indication of split)     
C ----------------------------------------------------->
                                                        |----> D1
                                                        |----> D2

I think that works well. Seeing as the edge metrics are meant to be from C's perspective, I suppose it makes sense that they're not attributed to a particular service on the edges. We can still look at transaction/node metrics for the split.

How would we know that we should remove the proxy from the graph, and that it's in between C and D? Perhaps like @eyalkoren described above, we include the destination service resource (proxy:...) in the outbound hash, and propagate that?

@AlexanderWert thanks for your input!

I just wanted to drop in a different idea / approach to realize the service map purely on metric data, thus detaching it from the need of collecting 100% of traces / spans, etc. Feels related to this issue.

We don't necessarily have to capture 100% of traces/spans. We have recently started aggregating metrics based on trace events in APM Server, and we scale them based on the configured sampling rate. The metrics are then stored and used for populating charts (currently opt-in, expected to become the default in the future.) I think it would make sense to extend these metrics as described above to power the service map.

As described above, each service would propagate the its own service name (or a hash, doesn't matter)

I'd just like to clarify one thing here. IIANM, what you illustrated in the table is a point-to-point graph representation. In that model you're right, it doesn't matter if we propagate the service name or a hash of it (disregarding possible privacy concerns). That's certainly an option, and would keep things fairly simple.

What @dgieselaar has described above is instead a path representation of a graph. This will enable the UI to filter the graph down to a subgraph that includes some node(s), and then only show metrics related to the paths through those nodes and not the excluded nodes. I'd be very interested to hear if you have experience with this approach.

Nov 05 '20 02:11 axw

@axw

How would we know that we should remove the proxy from the graph, and that it's in between C and D? Perhaps like @eyalkoren described above, we include the destination service resource (proxy:...) in the outbound hash, and propagate that?

It's removed from the graph by virtue of the span on service C being connected to the transaction on service D, via the hash. I'm not sure if we can tell that there is a proxy in between, or a load balancer, or any other non-instrumented services, even if span.destination.service.resource is included in the hash. But maybe I'm missing something?

Nov 05 '20 11:11 dgieselaar

I will assume "C" in the last comments was meant to be "B", even though the last one is confusing because there is a c -> d connection as well 🙂

It's removed from the graph by virtue of the span on service C being connected to the transaction on service D, via the hash.

If I read this correctly, it means that given these keys:

  {
    "key" : {
      "span.destination.hash" : "hashed-service-a-b",
      "transaction.upstream.hash" : null,
      "service.name" : "b",
      "span.destination.service.resource" : "proxy:3002"
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : null,
      "transaction.upstream.hash" : "hashed-service-a-b",
      "service.name" : "d",
      "span.destination.service.resource" : null
    },
    "doc_count" : 1
  }

there is a transaction.upstream.hash matching a span.destination.hash (hashed-service-a-b), which means there is a a -> b -> d path, so the algorithm will ignore the proxy:3002 exit and not treat it as external service.

However, it looks the same as looking at:

  {
    "key" : {
      "span.destination.hash" : "hashed-service-a-b",
      "transaction.upstream.hash" : null,
      "service.name" : "b",
      "span.destination.service.resource" : "postgres:3004"
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : null,
      "transaction.upstream.hash" : "hashed-service-a-b",
      "service.name" : "d",
      "span.destination.service.resource" : null
    },
    "doc_count" : 1
  }

How would you know that postgres:3004 is a real external service and proxy:3002 is a proxy to D? This is something you will be able to tell by sending the destination (or a hash of it) in addition to the path.

I'm not sure if we can tell that there is a proxy in between, or a load balancer, or any other non-instrumented services, even if span.destination.service.resource is included in the hash. But maybe I'm missing something?

I think you are right, this is not enough by itself to discover a proxy. Maybe this is something we can rely on request headers for - I think that the Host header should reflect the host and port used by the requested URI at the client side, so they should match. We have both, but not sure how reliable that is. In addition, the X-Forwarded-For header (or the like) can be used to reveal there is some mid tier.

As for load balancing (@axw's example), assuming we do send the destination, this should be easier - if multiple services (transactions) have the same upstream path AND destination, then you have enough info to add a load-balancer node to the map and have metrics for all edges - the edge to the load balancer and each edge from the load balancer to the service.

Nov 05 '20 13:11 eyalkoren

Ron suggested to do a POC, perhaps we can pivot https://github.com/elastic/kibana/issues/82598 into one? That way we don't need agent support, and we need to calculate paths there anyway. Thoughts?

Nov 10 '20 09:11 dgieselaar

@eyalkoren:

How would you know that postgres:3004 is a real external service and proxy:3002 is a proxy to D? This is something you will be able to tell by sending the destination (or a hash of it) in addition to the path.

I'm a little confused by postgres here - should that be something like service-d:3004 vs proxy-to-service-d:3004? I guess what we can get from this is that service B is talking to service D via different addresses. But that also might be because there are different instances of service B?

Nov 11 '20 08:11 dgieselaar

After a quick call with @eyalkoren, I understand what you mean and you are right: the outgoing hash should include the perceived destination. If we don't do that, when service A is talking to service B and postgres via the same hash (hashed-service-a), we would collapse the service A -> postgres connection into the service A -> service B connection.

Nov 11 '20 08:11 dgieselaar

One more thing to notice- if service B had two nodes behind a load balancer and the user chose to assign each its own unique service name, say - B1 and B2, then adding the destination helps with that as well - once you see that two services get the same upstream path (including the destination, e.g. hashed-service-a-lb:3002), it is enough for you to draw the load balancer node and have accurate metrics for all edges:

               -----> B1
              |
A ---> LB ----|
              |
               -----> B2

Nov 11 '20 08:11 eyalkoren

Ron suggested to do a POC, perhaps we can pivot elastic/kibana#82598 into one? That way we don't need agent support, and we need to calculate paths there anyway. Thoughts?

Sounds like a good idea to me. Perhaps start with a small POC (e.g. using some hand-written data like above) to validate the idea generally, and then expand on that by generating some complex graph data to test the scalability.

Nov 11 '20 09:11 axw

To work around the load balancer issue (which is actually happening on dev-next right now, see https://github.com/elastic/kibana/issues/83152#issuecomment-726729162), we could consider having an called service reply with a response header with its own hash. The calling service would then use this hash when storing span metrics. If the response header is not there, the calling service will hash its own hash + destination.service.resource. This would enable us to correctly map most of the calls. If the call to the load balancer fails, or the response header is not set for some other reason, we could group these metrics together and display them separately.

Nov 16 '20 08:11 dgieselaar

@dgieselaar response headers are of course an option that opens even more possibilities, however it means an implementation of a new capability by all agents, including the potential added complications (e.g. such related to modifying a response). For a quick POC, why not try out what I suggested in https://github.com/elastic/apm/issues/364#issuecomment-725287027?

Nov 16 '20 09:11 eyalkoren

@eyalkoren How would we correctly attribute span metrics to either B1 or B2? I thought metrics would be aggregated for A -> LB only.

Nov 16 '20 09:11 dgieselaar

Let's assume we have these data:

  {
    "key" : {
      "span.destination.hash" : "hashed-service-a-lb:3004",
      "transaction.upstream.hash" : null,
      "service.name" : "a",
      "span.destination.service.resource" : "lb:3004"
    },
    "doc_count" : 278
  },
  {
    "key" : {
      "span.destination.hash" : null,
      "transaction.upstream.hash" : "hashed-service-a-lb:3004",
      "service.name" : "b1",
      "span.destination.service.resource" : null
    },
    "doc_count" : 215
  },
  {
    "key" : {
      "span.destination.hash" : null,
      "transaction.upstream.hash" : "hashed-service-a-lb:3004",
      "service.name" : "b2",
      "span.destination.service.resource" : null
    },
    "doc_count" : 63
  }

Because we append the destination to the path, the fact that two services have the same transaction.upstream.hash implies that they are being load-balanced (there is a service talking to them through the same address). The entry that contains the matching span.destination.hash specifies also the span.destination.service.resource. So, we should be able to tell that service a sent 278 requests to lb:3004, 215 of which were handled by service b1 and 63 by b2. You would use transaction metrics for the lb:3004 -> b1 and lb:3004 -> b2 edges and use the exit span metrics for the a -> lb:3004 edge. Makes sense?

Nov 16 '20 11:11 eyalkoren

@eyalkoren it does, I was operating under the assumption that we'd use span metrics always. Are there any downsides to mixing those two?

Nov 16 '20 11:11 dgieselaar

In this case it is actually straightforward to rely on both I think.

In other cases, there may be contradictions, so we will have to decide how we treat those. I agree it will require some thinking. Maybe it even makes sense to put metrics on both sides of edges where relevant 😲

Nov 16 '20 13:11 eyalkoren

Instead of doing a composite agg on span.destination.hash and transaction.upstream.hash we could possibly use EQL as well. For instance, for joining a span and its resulting transaction, we could use the following EQL query (still using span.destination.service.resource here):

sequence
    [ span where span.destination.service.resource != null ] by span.id
    [ transaction where true ] by parent.id

Nov 28 '20 11:11 dgieselaar

This will be such an awesome improvement! Could you please help me verifying this issue against an API Gateway example below? I've seen it very often in modern environments with our customers.

Assumptions, we instrumented service A,B,C but not D (for various reasons, maybe it is owned by a vendor or can't be instrumented by our agents due to technology limitations). We don't monitor API Gateway today, and will unlikely be able to collect performance information about traces from them. (Although I heard requests in the past to visualize API gateways and load balancers on the map).

With the solutions proposed above, would we be able to successfully draw the right map for this configuration?

Detect and show connections to B and C and corresponding metrics.
And also determine uninstrumented backend(s) (i.e. D) and show their performance metrics separately.

Dec 01 '20 00:12 alex-fedotyev

I believe with the current discussed approach you would be able to draw B and C with proper metrics, but not D. I don't think we can rely on attributing excess exit counts from A (based on spans) to an "external service" because mismatches between span metrics and transaction metrics will be common.

In order to support that, we may add a span.destination.service.subresource field that will include the path (or part of the path), eg. /users, /account and /reports and opt in to use it in such cases.

@dgieselaar do I understand correctly that currently the idea is to do a POC with the discussed approach based on span and transaction documents and to apply that in the future to rely purely on stored metrics?

Dec 01 '20 07:12 eyalkoren

I believe with the current discussed approach you would be able to draw B and C with proper metrics, but not D. I don't think we can rely on attributing excess exit counts from A (based on spans) to an "external service" because mismatches between span metrics and transaction metrics will be common.

If the service responds with its own hash, and the calling service uses that hash to store its span metrics, we would not need transaction metrics, and we would have an "other" bucket where D would fall under, but calls that fail at the gateway or network issues would also fall into that bucket.

@dgieselaar do I understand correctly that currently the idea is to do a POC with the discussed approach based on span and transaction documents and to apply that in the future to rely purely on stored metrics?

Yes, but I'm not sure if we will get to that in the 7.11 timeframe. Might need @sqren or @graphaelli here for some prioritisation. Also, there are a couple of approaches in play, I'm not sure if we decided which one is best. We can investigate some of it in a POC.

Dec 01 '20 09:12 dgieselaar

but calls that fail at the gateway or network issues would also fall into that bucket.

Exactly, so in this regard it means adding complication but being left with the same limitation. Anything we can do with existing data is highly preferable and can be POC'd right away. I don't say response headers are out of the question and I recognise they have potential for additional value, but they will delay your POC quite a bit and they will probably delay GA, so if we can have something useful without them, I think it is a good start.

Moreover, one limitation to keep in mind with response headers is that they will not be able to support async communication, like messaging systems. If you think of a message bus used to create requests to multiple services, you can support that through the use of different destination resources/sub-resources (e.g. message queues/topics), but response headers are irrelevant for such use cases.

Dec 01 '20 10:12 eyalkoren

Yes, but I'm not sure if we will get to that in the 7.11 timeframe. Might need @sqren or @graphaelli here for some prioritisation. Also, there are a couple of approaches in play, I'm not sure if we decided which one is best. We can investigate some of it in a POC.

We are already pretty strapped for time and since service maps is not on the roadmap goals for 7.11 any bigger improvements will have to wait until 7.12.

Dec 01 '20 10:12 sorenlouv

apm apm copied to clipboard

Store upstream paths in transactions/spans for service maps

apm
apm copied to clipboard