quickwit
quickwit copied to clipboard
search nodes unresponsive with expensive aggregation
As reported:
If I do an aggregation over a large time span, the searchers will end up being killed by kubernetes for missing health checks. I'd assume because the CPU usage is too high and they are mostly un-responsive
It seems there are two issues
- Judging from the logs, the aggregation request is sent 50 times.
- The search thread pool takes all CPUs. This may not leave enough resources to answer health checks.
https://github.com/quickwit-oss/quickwit/pull/5304 takes all threads except one for the search thread pool