nextflow Copying many files from local storage to S3 causes hang

Bug report

If I attempt to copy a directory tree with many (>10,000) files from local to S3, Nextflow hangs.

Expected behavior and actual behavior

Expected: The tree syncs to S3. Actual: No objects copy to S3.

Steps to reproduce the problem

mkdir foo
for i in `seq 101`; do mkdir foo/$i; for j in `seq 100`; do echo foo >foo/$i/$j; done; done
nextflow -log nextflow.log fs cp foo s3://some-bucket/foo

If 101 is 100, then the copy succeeds.

Program output

There is no output to stdout. nextflow.log jstack.txt

Environment

Nextflow version: 23.10.1.5891
Java version: openjdk 17.0.6 2023-01-17 LTS (also openjdk 11)
Operating system: Linux (CentOS)
Bash version: GNU bash, version 4.4.20(1)-release (x86_64-redhat-linux-gnu)

Jan 27 '24 02:01 kdykdy

Not sure if it's interesting, a coincidence, or an interesting coincidence, but @kdykdy's in test case, when i and j are 100x100=10000 it works, but at 101x100=10100 it fails, and the last line of the debug.log,

Jan-27 00:34:10.343 [main] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'S3TransferManager' minSize=10; maxSize=10; workQueue=LinkedBlockingQueue[10000]; allowCoreThreadTimeout=false

mentions creating a LinkedBlockingQueue of size 10000. Seems like when the work doesn't exceed the queue things work but when there's more work than that, things hang.

Jan 29 '24 16:01 hartzell

Good catch. It looks like you can remove this limit with the following config option:

threadPool.S3TransferManager.maxQueueSize = -1

To be honest I'm not sure why there is a limit to begin with, since exceeding the limit causes a failure rather than just taking longer. However without the limit you might still encounter a crash if Nextflow allocates too many threads at once and runs out of memory.

You might also try using virtual threads. With Java 21 and Nextflow 23.10.0 virtual threads are enabled by default.

Jan 29 '24 18:01 bentsherman

Thanks @bentsherman -- I'm confused though. Having a worker queue with a limited size, that causes additions to block when there are no free workers, seems perfectly reasonable. Are you suggesting removing the limit as a solution or as a workaround until the queue is fixed?

Jan 29 '24 18:01 hartzell

Just a workaround for now. I don't know if we will remove the limit internally, there might be other reasons for it that I don't know about. But for you at least it should be a quick fix.

Jan 29 '24 19:01 bentsherman

Thanks. Checking some assumptions, if we set maxQueueSize to -1, I'm assuming that

the queue can grow without bound (or until running out of memory or ....); and
there is still a maximum amount of parallelism, we don't need to worry about making 10,000 simultaneous S3 API calls. Is that correct-ish?

Jan 29 '24 21:01 hartzell

Correct. The maximum number of concurrent threads is still limited by threadPool.S3TransferManager.maxThreads which defaults to max(10, n_cpus * 3)

Jan 29 '24 21:01 bentsherman

Thank you @bentsherman! That config option must be greater than 0, and setting to a sufficiently large value (such as 2^31 - 1) does indeed work around the problem.

Jan 29 '24 23:01 kdykdy

nextflow nextflow copied to clipboard

Copying many files from local storage to S3 causes hang

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

nextflow
nextflow copied to clipboard