elasticsearch-java Threads lock scenario at BulkIngester // FnCondition with high concurrency setup

Java API client version

7.17.12

Java version

11

Elasticsearch Version

7.17.12

Problem description

Hi, I think I have found a bug with the BulkIngester, maybe an issue with the locks.

The problem is, that only certain dev machines and some servers show this issue. We run the 7.17.12 java client lib. I cannot 100% figure out what is going on, and it probably makes no sense to create a ticket for this without being able to reproduce it properly. I have attached a thread dump which shows several threads still waiting, I hope this helps.

More context:

We use the bulk ingester to index a file with ~12k documents (just one example file). it runs to 99%, then gets stuck, and because we have configured a 10sec flush interval on the BulkIngester, every 10 seconds we see a bulk context getting flushed with just a single document in it. This goes on for 3 to 4 minutes and every 10 seconds the same picture: one bulk request with a single add operation. A thread dump shows that some threads are waiting in BulkIngester.add, which is waiting inside the FnCondition.whenReadyIf(...) at the "awaitUninterruptibly" call. So it seems, one bulk request comes back with a single request in it, that triggers the addCondition.signalIfReady() call which then lets the next request through, but again with just one single request in it. this does not happen when debugging, this does not happen when adding a per document log message, thats why I think it is a race condition somewhere. If I change the addCondition.signalIfReady() to signalAllIfReady, it works, but I would really like to find out the actual root cause of this!

I have a 32 core CPU, we are collecting and preparing our index documents in parallel. When I limit the pool to 8 threads, then it also works just fine.

thread_dump.txt

Aug 21 '23 14:08 codehustler

Hi, I have the same problem with the version Java API client 8.12.1 and the Elasticsearch version 8.12.1.

One detail that I saw is that when printing the id of the bulk in the beforeBulk there were jumps when the id should be sequential.

@codehustler, Did you find any other solution?

Apr 02 '24 11:04 victorGS18

Hello, I'd like to try and reproduce this, I have used the BulkIngester recently with an 80K rows document and nothing of sort happened. Could you provide the code for the BulkIngester configuration? Thank you.

Apr 22 '24 10:04 l-trotta

Hello, We create the BulkIngester with the following code: BulkIngester.of(b -> b.client(esClient) .globalSettings(gsb -> gsb.refresh(Refresh.False)) .maxOperations(120) .flushInterval(10, TimeUnit.SECONDS) .listener(bulkListener) .maxConcurrentRequests(8) .maxSize(new ByteSizeValue(1, MB).getBytes())

We have observed this when there is a high level of parallelism. With an instance c5.9x.large (36 cores) it happened to us very often, however we changed the instance to a c5.4xlarge (16 cores) and it has not happened to us again.

The bulk Listener that we use just log the errors and retry depending on the error.

Apr 22 '24 10:04 victorGS18

Thank you @victorGS18, we'll investigate this.

Apr 22 '24 14:04 l-trotta

I believe that we are running into this very same issue. We have multiple threads all sending updates to a single BulkIngester. They seems to hang in the same spot at FnCondition.whenReadyIf.

Apr 22 '24 18:04 nmaves

elasticsearch-java elasticsearch-java copied to clipboard

Threads lock scenario at BulkIngester // FnCondition with high concurrency setup

Java API client version

Java version

Elasticsearch Version

Problem description

elasticsearch-java
elasticsearch-java copied to clipboard