chef-server icon indicating copy to clipboard operation
chef-server copied to clipboard

WIP: [erchef] Limit the number of inflight ES updates

Open stevendanna opened this issue 5 years ago • 2 comments

This limits the total number of inflight requests to update the search index. The goal is to limit the unbounded growth that we can currently see as a result of the unlimited message queue on the chef_index_batch program and the large number of requests that can be stuck waiting for ES.

Signed-off-by: Steven Danna [email protected]

stevendanna avatar May 16 '20 00:05 stevendanna

I do have one concern; I think if an error happens when a ticket is checked out we would 'leak' the ticket. A loaded system might well be be at higher risk of failures, so that could compound itself.

This is the main reason it is still WIP. I think in its current use case, the main area we would see this is if the flusher process exits for some reason. If the batcher exits, then the ets table will also disappear because the batcher owns it. I think if we were to trap exit we could then take care of failures in the flusher process.

Have you run tests against this; what's here is certainly enough to test the basic functionality of the concept under load.

I have, but I haven't had enough time to gain certainty. Here is my current thinking on what the "best" path for us might be:

  1. Move to "inline" by default. This means that each request will generate a POST to ES, but in my experiments, all but the most heavily loaded servers likely operation in this mode anyway.
  2. Have separate pools for index ES requests vs search ES requests. That will give us a toggle to naturally limit the number of inflight ES index requests. We then produce back pressure in the same way we do with postgresql.
  3. Use a similar mechanism to this but which allows waiting rather than immediately returning and wrap it around the document expansion. What I've seen in testing is that right now we actually allow an unbounded number of processes to be expanding at the same time. Expansion is almost entirely CPU/memory bound so increasing parallelism too high ends up just thrashing the machine. I have an implementation locally using a simple gen_server. I like the idea of using a dummy pooler pool. I originally wanted to avoid pooler because I didn't want to introduce another process that had to share the binary data, but a dummy pool is maybe a nice hack to avoid another ad-hoc pool/lock mechanism.

I'm not sure how far I'll be able to get on this, but would love your thoughts on the general plan.

I hope to get back to this once more of the pipeline is green.

stevendanna avatar May 21 '20 18:05 stevendanna

@PrajaktaPurohit Is this work we want to continue?

tas50 avatar Mar 02 '21 16:03 tas50