java-client-api Additional threads created even if thread count is set to one.

Version of MarkLogic Java Client API

v5.0.1

Version of MarkLogic Server

v10.0-2.1

Java version

openjdk version "11.0.3" 2019-04-16

OS and version

ProductName: Mac OS X ProductVersion: 10.15.3 BuildVersion: 19D76

Input:

QueryBatcher queryBatcher = moveMgr.newQueryBatcher(rcQueryDef)
        .withBatchSize(100)
        .withThreadCount(1)
        .onUrisReady(batch -> {
          System.out.println("=======>" + batch.getItems().length);
        })
        .onQueryFailure(Throwable::printStackTrace);
queryBatcher.setMaxBatches(1);
moveMgr.startJob(queryBatcher);
queryBatcher.awaitCompletion();

Actual output:

2020-02-28 10:57:38.516 [task-1] INFO  c.m.c.d.impl.QueryBatcherImpl - Starting job batchSize=100, threadCount=1, onUrisReady listeners=2, failure listeners=4
=======>100
=======>100
=======>100
2020-02-28 10:57:38.555 [task-1] INFO  c.m.c.d.impl.QueryBatcherImpl - Job complete, jobBatchNumber=3, jobResultsSoFar=300

Expected output:

The setup has batch size = 100 thread count = 1 max batches = 1 number of forests = 3

Based on the thread count of 1 the querybatcher should only pull 100 docs. Currently, it's pulling 300 items i.e. 100 items per forest even though the thread count is set to 1.

Feb 29 '20 16:02 akshaysonvane

As per documentation:

http://pubs.marklogic.com:8011/guide/java/data-movement#id_66180 For WriteBatcher:

The thread count configuration parameter of a WriteBatcher is the number of threads in the client JVM that will be dedicated to writing batches to MarkLogic. The threads operate in parallel, each servicing one batch at a time.

Ideally, you should choose a thread count that will keep most of the job threads busy and keep MarkLogic busy without overwhelming your cluster. You should usually configure at least as many client threads as hosts containing forests in the target database. The default is one thread per forest host.

Mar 02 '20 19:03 georgeajit

http://pubs.marklogic.com:8011/guide/java/data-movement#id_67998

In QueryBatcher :

threads operate in parallel, each servicing one batch at a time.

Ideally, you should choose a thread count that will keep most of the job threads busy. If your listener interacts with MarkLogic, you should ideally also keep MarkLogic busy without overwhelming the cluster. For a job that interacts with MarkLogic, you should usually have more client threads than hosts containing forests in the target database.

Mar 02 '20 19:03 georgeajit

I don't think this was ever a bug - the thread isn't pinned to a host, and it's expected to retrieve all matching URIs, regardless of the number of threads and number of forests.

Nov 17 '22 15:11 rjrudin

Hey @rjrudin, I think we need to look at it from the batch perspective rather than the internal thread perspective. Based on the current configuration of batch size 100 and max batch 1 I would expect only the first 100 records to be fetched. IIRC I wanted to limit the number of records being sent to the user based on the user-defined limit in HC. If we had to use query batcher for this then the only option would be to fetch all records and then drop the extra records in the middle tier code leading to inefficiencies.

Nov 17 '22 15:11 akshaysonvane

Thanks @akshaysonvane , I missed the call to setMaxBatches. That appears to work fine when the QueryBatcher receives an Iterator of URIs, but it doesn't appear to work otherwise. The problem isn't that additional threads are created, but that doesn't really matter - the problem is that setMaxBatches doesn't have any impact. Or the docs for the method are at best incorrect. Will track internally.

Nov 17 '22 16:11 rjrudin

java-client-api java-client-api copied to clipboard

Additional threads created even if thread count is set to one.

Version of MarkLogic Java Client API

Version of MarkLogic Server

Java version

OS and version

Input:

Actual output:

Expected output:

java-client-api
java-client-api copied to clipboard