quickwit
quickwit copied to clipboard
Indexing performance drop
We apparently have a performance regression between
- https://github.com/quickwit-oss/quickwit/commit/1b0c87655a927034a06c2743564e9bf1d6ee2da1 (excluded)
- https://github.com/quickwit-oss/quickwit/commit/ab81a518862473dc3da6462128b197a346bac4bc (included)
as spotted by @fmassot
Tantivy went from -tantivy = { git = "https://github.com/quickwit-oss/tantivy/", rev = "6181c1e", default-features = false, features = [ +tantivy = { git = "https://github.com/quickwit-oss/tantivy/", rev = "92b5526", default-features = false, features = [
Note that the CPU time did not change:
So the regression is probably due to some reduction in computation parallelism, and not the computations themselves taking more CPU time.
Not necessarily tantivy then
The drop is caused by this PR: Fix the ingest rate displayed in the CLI https://github.com/quickwit-oss/quickwit/pull/4682
https://qw-benchmarks.104.155.161.122.nip.io/?run_ids=1573,1574&search_metric=engine_duration
(the name for 894188f19 should be before_fix_ingest_rate)
:) I didn't see that coming :D
@PSeitz This PR seems very safe BUT... it actually DOES something on the server side of ingest v1.
It introduced a change (hopefully a bugfix) in the code of our rate estimator.
The rate estimator itself is used in a strange RateModulator.
The ideas is this: Quickwit need one way or another to have some backpressure mechanism. It was judged at the time that sending back 419 status code could be a problem from clients.
To avoid it (it does not really avoid it but well), when quickwit sees the queue memory getting close to its limit, it will smoothly time::sleep on the server side before returning a 200.
If the memory limit is reached we do return 419 however.
We can try and remove this logic, and see if it works. (the quickwit client, retries after 500ms upon receiving a 419 anyway). If this fixes the bug, then we can dig deeper and see if:
- we still have a bug on the rate estimator
- if this is a side effect of the rate modulator kicking in more agressively when it has the right rate figures
- we still want this rate modulator or if we want to fix the code etc.
Can you rerun the bench without the rate modulation layer?
You will find it in the start_ingest_client_if_needed function.
Disabling the rate modulator fixes the performance issue: https://qw-benchmarks.104.155.161.122.nip.io/?run_ids=1573,1581&search_metric=engine_duration
The first call to get the memory use in the rate modulator is returning 5 (5x the max capacity), which then causes the ingestion to be slower. I tested to set memory_usage_ratio to a fixed 0.1, but it's still slightly slower than before.
https://qw-benchmarks.104.155.161.122.nip.io/?run_ids=1573,1581,1584&search_metric=engine_duration
[quickwit-serve/src/rate_modulator.rs:63:9] memory_usage_ratio = 5.448237061500549e-8
[quickwit-serve/src/rate_modulator.rs:63:9] memory_usage_ratio = 0.0023282133042812347
[quickwit-serve/src/rate_modulator.rs:63:9] memory_usage_ratio = 0.0023282133042812347
[quickwit-serve/src/rate_modulator.rs:63:9] memory_usage_ratio = 0.0023282133042812347
[quickwit-serve/src/rate_modulator.rs:63:9] memory_usage_ratio = 0.0023282133042812347
[quickwit-serve/src/rate_modulator.rs:63:9] memory_usage_ratio = 0.004656488075852394