[BUG] High latency when fanning out large requests to many shards.
Describe the bug
We're issuing a search request that is a ~2MB JSON (lots of filters) against a 1000-shard, 25-node OpenSearch index. The index has 2TB at the time of issue but we're indexing and will eventually have 30TB, so reducing shard count isn't an option. Single-shard execution is relatively quick (~500ms) but we're observing 6s latency when issuing the query against all shards. We've set the pre_filter_shard_size and max_concurrent_shard_requests to 999999 so the fanout should be as large as possible and all shards execute in parallel. We think we've narrowed down the source of the high E2E latency to request serialization.
IIUC, request serialization occurs synchronously for every shard the request is being fanned out to. Given a lot of the work is in serializing the search source and the source shouldn't vary between shards, does it make sense to have a caching layer to avoid duplicating this work? If so, we'd be happy to contribute.
Related component
Search:Performance
To Reproduce
Create a 1000-shard cluster, issue a query ANDing a bunch of large TermInSet queries (number of characters-wise, so large terms) against it.
Expected behavior
We expect all shard requests to be emitted at roughly the same time.
Additional Details
Plugins gcs-repository a custom plugin with custom query types, rescorers, highlighters, and admission controller action filters.
Screenshots
Plot of shard parallelism based on debug logging we added for start and end times:
[Search Triage] This seems like a good optimization, happy to assist if you need some help in improving it.
I guess this should be the culprit here: https://github.com/opensearch-project/OpenSearch/blob/09bacee5fc85676e97bee6b4ad87dec35c6aa8cc/server/src/main/java/org/opensearch/search/internal/ShardSearchRequest.java#L309 and can serve as a starting point.
If we could memoize this, and ensure that the memoized/serialized request is not mutable anymore, that should ideally work.
We have a fix for this and it's significantly improved tail latency on our 2.19 clusters. Before sending a PR, do you have any guidelines on how to test? We don't run 3.x clusters so it's hard to validate that the PR against main offers the same latency improvement and doesn't break any functionality. Does running the integration test suite on the PR against main suffice?
Friendly follow-up on this ^