quickwit icon indicating copy to clipboard operation
quickwit copied to clipboard

Indexing performance drop

Open fulmicoton opened this issue 1 year ago • 7 comments

We apparently have a performance regression between

  • https://github.com/quickwit-oss/quickwit/commit/1b0c87655a927034a06c2743564e9bf1d6ee2da1 (excluded)
  • https://github.com/quickwit-oss/quickwit/commit/ab81a518862473dc3da6462128b197a346bac4bc (included) image

as spotted by @fmassot

fulmicoton avatar May 24 '24 06:05 fulmicoton

Tantivy went from -tantivy = { git = "https://github.com/quickwit-oss/tantivy/", rev = "6181c1e", default-features = false, features = [ +tantivy = { git = "https://github.com/quickwit-oss/tantivy/", rev = "92b5526", default-features = false, features = [

fulmicoton avatar May 24 '24 06:05 fulmicoton

Note that the CPU time did not change: Screenshot from 2024-05-24 10-58-05 So the regression is probably due to some reduction in computation parallelism, and not the computations themselves taking more CPU time.

RaphaelMarinier avatar May 24 '24 09:05 RaphaelMarinier

Not necessarily tantivy then

fulmicoton avatar May 24 '24 09:05 fulmicoton

The drop is caused by this PR: Fix the ingest rate displayed in the CLI https://github.com/quickwit-oss/quickwit/pull/4682

https://qw-benchmarks.104.155.161.122.nip.io/?run_ids=1573,1574&search_metric=engine_duration (the name for 894188f19 should be before_fix_ingest_rate)

PSeitz avatar May 27 '24 03:05 PSeitz

:) I didn't see that coming :D

fulmicoton avatar May 27 '24 04:05 fulmicoton

@PSeitz This PR seems very safe BUT... it actually DOES something on the server side of ingest v1.

It introduced a change (hopefully a bugfix) in the code of our rate estimator. The rate estimator itself is used in a strange RateModulator.

The ideas is this: Quickwit need one way or another to have some backpressure mechanism. It was judged at the time that sending back 419 status code could be a problem from clients.

To avoid it (it does not really avoid it but well), when quickwit sees the queue memory getting close to its limit, it will smoothly time::sleep on the server side before returning a 200. If the memory limit is reached we do return 419 however.

We can try and remove this logic, and see if it works. (the quickwit client, retries after 500ms upon receiving a 419 anyway). If this fixes the bug, then we can dig deeper and see if:

  • we still have a bug on the rate estimator
  • if this is a side effect of the rate modulator kicking in more agressively when it has the right rate figures
  • we still want this rate modulator or if we want to fix the code etc.

Can you rerun the bench without the rate modulation layer? You will find it in the start_ingest_client_if_needed function.

fulmicoton avatar May 27 '24 04:05 fulmicoton

Disabling the rate modulator fixes the performance issue: https://qw-benchmarks.104.155.161.122.nip.io/?run_ids=1573,1581&search_metric=engine_duration

The first call to get the memory use in the rate modulator is returning 5 (5x the max capacity), which then causes the ingestion to be slower. I tested to set memory_usage_ratio to a fixed 0.1, but it's still slightly slower than before. https://qw-benchmarks.104.155.161.122.nip.io/?run_ids=1573,1581,1584&search_metric=engine_duration

[quickwit-serve/src/rate_modulator.rs:63:9] memory_usage_ratio = 5.448237061500549e-8
[quickwit-serve/src/rate_modulator.rs:63:9] memory_usage_ratio = 0.0023282133042812347
[quickwit-serve/src/rate_modulator.rs:63:9] memory_usage_ratio = 0.0023282133042812347
[quickwit-serve/src/rate_modulator.rs:63:9] memory_usage_ratio = 0.0023282133042812347
[quickwit-serve/src/rate_modulator.rs:63:9] memory_usage_ratio = 0.0023282133042812347
[quickwit-serve/src/rate_modulator.rs:63:9] memory_usage_ratio = 0.004656488075852394

PSeitz avatar May 27 '24 09:05 PSeitz