nextflow
nextflow copied to clipboard
Copying many files from local storage to S3 causes hang
Bug report
If I attempt to copy a directory tree with many (>10,000) files from local to S3, Nextflow hangs.
Expected behavior and actual behavior
Expected: The tree syncs to S3. Actual: No objects copy to S3.
Steps to reproduce the problem
mkdir foo
for i in `seq 101`; do mkdir foo/$i; for j in `seq 100`; do echo foo >foo/$i/$j; done; done
nextflow -log nextflow.log fs cp foo s3://some-bucket/foo
If 101 is 100, then the copy succeeds.
Program output
There is no output to stdout. nextflow.log jstack.txt
Environment
- Nextflow version:
23.10.1.5891
- Java version:
openjdk 17.0.6 2023-01-17 LTS
(alsoopenjdk 11
) - Operating system: Linux (CentOS)
- Bash version:
GNU bash, version 4.4.20(1)-release (x86_64-redhat-linux-gnu)
Not sure if it's interesting, a coincidence, or an interesting coincidence, but @kdykdy's in test case, when i
and j
are 100x100=10000
it works, but at 101x100=10100
it fails, and the last line of the debug.log
,
Jan-27 00:34:10.343 [main] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'S3TransferManager' minSize=10; maxSize=10; workQueue=LinkedBlockingQueue[10000]; allowCoreThreadTimeout=false
mentions creating a LinkedBlockingQueue
of size 10000
. Seems like when the work doesn't exceed the queue things work but when there's more work than that, things hang.
Good catch. It looks like you can remove this limit with the following config option:
threadPool.S3TransferManager.maxQueueSize = -1
To be honest I'm not sure why there is a limit to begin with, since exceeding the limit causes a failure rather than just taking longer. However without the limit you might still encounter a crash if Nextflow allocates too many threads at once and runs out of memory.
You might also try using virtual threads. With Java 21 and Nextflow 23.10.0 virtual threads are enabled by default.
Thanks @bentsherman -- I'm confused though. Having a worker queue with a limited size, that causes additions to block when there are no free workers, seems perfectly reasonable. Are you suggesting removing the limit as a solution or as a workaround until the queue is fixed?
Just a workaround for now. I don't know if we will remove the limit internally, there might be other reasons for it that I don't know about. But for you at least it should be a quick fix.
Thanks.
Checking some assumptions, if we set maxQueueSize
to -1
, I'm assuming that
- the queue can grow without bound (or until running out of memory or ....); and
- there is still a maximum amount of parallelism, we don't need to worry about making
10,000
simultaneous S3 API calls. Is that correct-ish?
Correct. The maximum number of concurrent threads is still limited by threadPool.S3TransferManager.maxThreads
which defaults to max(10, n_cpus * 3)
Thank you @bentsherman! That config option must be greater than 0, and setting to a sufficiently large value (such as 2^31 - 1
) does indeed work around the problem.