quickwit
quickwit copied to clipboard
CPU never reaching 100%
Quoting @kstaken in #1679
The bad news is that query times are 69-72 seconds with no evidence of caching making any difference. Previously the first query > would be in the 60 second range but subsequent queries dropped to around 30 seconds. We also tried doubling the cache sizes to confirm we weren't simply churning the cache and it made no difference. It also never uses much beyond 2 CPUs but maybe that makes sense since it's using async concurrency for the split searches. Previously it was using 4 CPUs.
For the caching part, I opened the ticket. We use an LRU cache, and your use case is the worst case scenario for an LRU policy. I noticed that last week. I just created a ticket #1714.
Once the data is downloaded, the computation is executed in a thread pool size. The thread pool is a wrapper around a rayon thread pool. with the same number of threads as the number of CPU.
Here are possible reasons: ~~a) RAYON_NUM_THREADS env variable was set to 2 for another reason, (unlikely)~~ ~~b) minio is the bottleneck (unlikely)~~ c) the number of parallel split is not high enough to get the best throughput off minio (unlikely) d) network is the bottleneck (unlikely) e) a straight bug.
c) and d) are the most likely.
I'm adding a bunch of counters that should help identify the problem rapidly.
If you want to test hypothesis c), the idea is just to increase max_num_concurrent_split_searches.
It will increase the RAM usage, and download more split data concurrently.
We've experimented with increasing max_num_concurrent_split_searches with no obvious change in utilization. We tried 64 and 128.
Last week we had also been experimenting with RAYON_NUM_THREADS and it seemed to make no difference either. The default for that in our env should be 8 and we tested increasing it to 24 with no impact.
The last test we ran was with max_num_concurrent_split_searches=128 and RAYON_NUM_THREADS=24 and again no real change in utilization.
Network is seeing lower utilization than previously. Last week before the memory fix we were seeing QW using 4+ CPUs and network utilization was roughly 1.8-1.9GB/s. Currently at just over 2 CPUs of utilization we are only seeing around 930-960MB/s. The test node has 2 x 10gbps bonded NICs so the number prior to the memory fix is pretty close to where we would expect the network to max out.
For Minio we have 32 nodes provisioned and we can scale QW searchers and see expected throughput increase as concurrency increases and nothing in the metrics for Minio look troubling so I don't think Minio is a source of concern.
I think we can consider a) and b) out.
c) d) are unlikely but I cannot rule them out straight away without seeing high resolution (as in over rather short window of 3s or so) throughput figures.