java-client-api
java-client-api copied to clipboard
Additional threads created even if thread count is set to one.
Version of MarkLogic Java Client API
v5.0.1
Version of MarkLogic Server
v10.0-2.1
Java version
openjdk version "11.0.3" 2019-04-16
OS and version
ProductName: Mac OS X ProductVersion: 10.15.3 BuildVersion: 19D76
Input:
QueryBatcher queryBatcher = moveMgr.newQueryBatcher(rcQueryDef)
.withBatchSize(100)
.withThreadCount(1)
.onUrisReady(batch -> {
System.out.println("=======>" + batch.getItems().length);
})
.onQueryFailure(Throwable::printStackTrace);
queryBatcher.setMaxBatches(1);
moveMgr.startJob(queryBatcher);
queryBatcher.awaitCompletion();
Actual output:
2020-02-28 10:57:38.516 [task-1] INFO c.m.c.d.impl.QueryBatcherImpl - Starting job batchSize=100, threadCount=1, onUrisReady listeners=2, failure listeners=4
=======>100
=======>100
=======>100
2020-02-28 10:57:38.555 [task-1] INFO c.m.c.d.impl.QueryBatcherImpl - Job complete, jobBatchNumber=3, jobResultsSoFar=300
Expected output:
The setup has batch size = 100 thread count = 1 max batches = 1 number of forests = 3
Based on the thread count of 1 the querybatcher should only pull 100 docs. Currently, it's pulling 300 items i.e. 100 items per forest even though the thread count is set to 1.
As per documentation:
http://pubs.marklogic.com:8011/guide/java/data-movement#id_66180 For WriteBatcher:
The thread count configuration parameter of a WriteBatcher is the number of threads in the client JVM that will be dedicated to writing batches to MarkLogic. The threads operate in parallel, each servicing one batch at a time.
Ideally, you should choose a thread count that will keep most of the job threads busy and keep MarkLogic busy without overwhelming your cluster. You should usually configure at least as many client threads as hosts containing forests in the target database. The default is one thread per forest host.
http://pubs.marklogic.com:8011/guide/java/data-movement#id_67998
In QueryBatcher :
threads operate in parallel, each servicing one batch at a time.
Ideally, you should choose a thread count that will keep most of the job threads busy. If your listener interacts with MarkLogic, you should ideally also keep MarkLogic busy without overwhelming the cluster. For a job that interacts with MarkLogic, you should usually have more client threads than hosts containing forests in the target database.
I don't think this was ever a bug - the thread isn't pinned to a host, and it's expected to retrieve all matching URIs, regardless of the number of threads and number of forests.
Hey @rjrudin, I think we need to look at it from the batch perspective rather than the internal thread perspective.
Based on the current configuration of batch size 100
and max batch 1
I would expect only the first 100 records to be fetched.
IIRC I wanted to limit the number of records being sent to the user based on the user-defined limit in HC. If we had to use query batcher for this then the only option would be to fetch all records and then drop the extra records in the middle tier code leading to inefficiencies.
Thanks @akshaysonvane , I missed the call to setMaxBatches
. That appears to work fine when the QueryBatcher receives an Iterator of URIs, but it doesn't appear to work otherwise. The problem isn't that additional threads are created, but that doesn't really matter - the problem is that setMaxBatches doesn't have any impact. Or the docs for the method are at best incorrect. Will track internally.