apm-server Add index routing based on trace.id

Opening this up for discussion:

Should we start routing documents based on trace.id. The thinking here is that we can always rely on a full trace being available in a single shard.

This potentially has the following benefits:

Can use search time routing to get the data for trace waterfall
We can reduce trace paths during the reduce phase of a scripted metric aggregation.
Potentially faster joins with ESQL?
Faster sequencing with EQL?

Unknowns:

Datastreams require an explicit opt in to enable routing, what are the consequences of opting in.

Aug 04 '22 11:08 Mpdreamz

@dgieselaar also raised this a couple of months ago (on Slack). There are a couple of problems: CCS, and service-specific (or some other kind of partitioned) trace data streams. If either of those are present in a system, we cannot assume all trace events for a given trace.id are in the same shard.

It suppose it might still be useful for limiting which shards are searched. One related concern I have is that this could cause shard hot spotting, either due to bugs/weirdness (e.g. https://github.com/elastic/apm-server/issues/3922) or through malicious intent.

Aug 08 '22 09:08 axw

Looking into this a bit, we would need to

allow custom routing in the index template via datastream.allow_custom_routing: true. This requires an update to the package spec to allow the additional value. @kpollich would this also require a change in Fleet?
add a user facing config option to disable the functionality, e.g. for CCS
adapt bulk indexer logic to add the trace.id of an event to the request, if the functionality is enabled
update docs

Jan 19 '23 17:01 simitt