opensearch-go
opensearch-go copied to clipboard
[BUG] Extreme latency in BulkIndexer
What is the bug?
It seems that with a BulkIndexer with 2 workers, I am getting unexpected latency on BulkIndexer.Add(). It seems that somehow the workers are not consuming the queue within any reasonable sort of timeframe, I'm seeing delays of over 20s!
For example, in the last hour I've 53 cases of >1s latency on just Add() out of a total of 174 calls.
How can one reproduce the bug? With 2 workers running, adding items from different goroutines and a relatively busy search cluster.
What is the expected behavior? Sub-millisecond latencies, basically the time it takes to shove something into a channel.
What is your host/environment?
- OS: Ubuntu 20.02
- Version: 1.1.0 (but nothing has changed to the bulkindexer since the fork from ES)
Do you have any screenshots?

Note; returning to the default of numCPU workers seems to alleviate the issue, but given the significant delays (seconds versus sub-millisecond) I would still strongly argue that there is an underlying issue here. At the very least I would suggest documenting this unexpected behaviour.
Please see the difference below:

The issue seems reduced but it is still occurring!

CPU load on this server is around 10% and the load average is around 4. There is still about 10% of Add() calls which takes >1s.
@VijayanB @VachaShah Any ideas?
Poke!
@dokterbob This looks visible problematic, but doesn't look like the folks here got to looking into it. Let's try to move this forward? First, what's the easiest way to reproduce this (maybe post code similar to the benchmarks in this project)? Are you able to bulk load data a lot faster into this instance with other mechanisms (aka is this a client issue for sure)?
@dokterbob Do you mind posting some of the code you were using to help pinpoint this issue?
Sorry, I didn't see the messages. Code is https://github.com/ipfs-search/ipfs-search/ but of course you'll need a more detailed test case.
After increasing the workers it seems the problem has become less severe. Now that it's been picked up I'll see if I can get more concrete feedback over the next couple of weeks.
@dokterbob hey, I want to work on solving the issue, how relevant is this? Are there any problems now, or was it on the client side?