Add index routing based on trace.id
Opening this up for discussion:
Should we start routing documents based on trace.id.
The thinking here is that we can always rely on a full trace being available in a single shard.
This potentially has the following benefits:
- Can use search time routing to get the data for trace waterfall
- We can reduce trace paths during the
reducephase of a scripted metric aggregation. - Potentially faster joins with ESQL?
- Faster sequencing with EQL?
Unknowns:
- Datastreams require an explicit opt in to enable routing, what are the consequences of opting in.
@dgieselaar also raised this a couple of months ago (on Slack). There are a couple of problems: CCS, and service-specific (or some other kind of partitioned) trace data streams. If either of those are present in a system, we cannot assume all trace events for a given trace.id are in the same shard.
It suppose it might still be useful for limiting which shards are searched. One related concern I have is that this could cause shard hot spotting, either due to bugs/weirdness (e.g. https://github.com/elastic/apm-server/issues/3922) or through malicious intent.
Looking into this a bit, we would need to
- allow custom routing in the index template via
datastream.allow_custom_routing: true. This requires an update to the package spec to allow the additional value. @kpollich would this also require a change in Fleet? - add a user facing config option to disable the functionality, e.g. for CCS
- adapt bulk indexer logic to add the
trace.idof an event to the request, if the functionality is enabled - update docs