apm-server icon indicating copy to clipboard operation
apm-server copied to clipboard

Add index routing based on trace.id

Open Mpdreamz opened this issue 3 years ago • 1 comments

Opening this up for discussion:

Should we start routing documents based on trace.id. The thinking here is that we can always rely on a full trace being available in a single shard.

This potentially has the following benefits:

  • Can use search time routing to get the data for trace waterfall
  • We can reduce trace paths during the reduce phase of a scripted metric aggregation.
  • Potentially faster joins with ESQL?
  • Faster sequencing with EQL?

Unknowns:

Mpdreamz avatar Aug 04 '22 11:08 Mpdreamz

@dgieselaar also raised this a couple of months ago (on Slack). There are a couple of problems: CCS, and service-specific (or some other kind of partitioned) trace data streams. If either of those are present in a system, we cannot assume all trace events for a given trace.id are in the same shard.

It suppose it might still be useful for limiting which shards are searched. One related concern I have is that this could cause shard hot spotting, either due to bugs/weirdness (e.g. https://github.com/elastic/apm-server/issues/3922) or through malicious intent.

axw avatar Aug 08 '22 09:08 axw

Looking into this a bit, we would need to

  • allow custom routing in the index template via datastream.allow_custom_routing: true. This requires an update to the package spec to allow the additional value. @kpollich would this also require a change in Fleet?
  • add a user facing config option to disable the functionality, e.g. for CCS
  • adapt bulk indexer logic to add the trace.id of an event to the request, if the functionality is enabled
  • update docs

simitt avatar Jan 19 '23 17:01 simitt