nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Copying many files from local storage to S3 causes hang

Open kdykdy opened this issue 1 year ago • 7 comments

Bug report

If I attempt to copy a directory tree with many (>10,000) files from local to S3, Nextflow hangs.

Expected behavior and actual behavior

Expected: The tree syncs to S3. Actual: No objects copy to S3.

Steps to reproduce the problem

mkdir foo
for i in `seq 101`; do mkdir foo/$i; for j in `seq 100`; do echo foo >foo/$i/$j; done; done
nextflow -log nextflow.log fs cp foo s3://some-bucket/foo

If 101 is 100, then the copy succeeds.

Program output

There is no output to stdout. nextflow.log jstack.txt

Environment

  • Nextflow version: 23.10.1.5891
  • Java version: openjdk 17.0.6 2023-01-17 LTS (also openjdk 11)
  • Operating system: Linux (CentOS)
  • Bash version: GNU bash, version 4.4.20(1)-release (x86_64-redhat-linux-gnu)

kdykdy avatar Jan 27 '24 02:01 kdykdy

Not sure if it's interesting, a coincidence, or an interesting coincidence, but @kdykdy's in test case, when i and j are 100x100=10000 it works, but at 101x100=10100 it fails, and the last line of the debug.log,

Jan-27 00:34:10.343 [main] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'S3TransferManager' minSize=10; maxSize=10; workQueue=LinkedBlockingQueue[10000]; allowCoreThreadTimeout=false

mentions creating a LinkedBlockingQueue of size 10000. Seems like when the work doesn't exceed the queue things work but when there's more work than that, things hang.

hartzell avatar Jan 29 '24 16:01 hartzell

Good catch. It looks like you can remove this limit with the following config option:

threadPool.S3TransferManager.maxQueueSize = -1

To be honest I'm not sure why there is a limit to begin with, since exceeding the limit causes a failure rather than just taking longer. However without the limit you might still encounter a crash if Nextflow allocates too many threads at once and runs out of memory.

You might also try using virtual threads. With Java 21 and Nextflow 23.10.0 virtual threads are enabled by default.

bentsherman avatar Jan 29 '24 18:01 bentsherman

Thanks @bentsherman -- I'm confused though. Having a worker queue with a limited size, that causes additions to block when there are no free workers, seems perfectly reasonable. Are you suggesting removing the limit as a solution or as a workaround until the queue is fixed?

hartzell avatar Jan 29 '24 18:01 hartzell

Just a workaround for now. I don't know if we will remove the limit internally, there might be other reasons for it that I don't know about. But for you at least it should be a quick fix.

bentsherman avatar Jan 29 '24 19:01 bentsherman

Thanks. Checking some assumptions, if we set maxQueueSize to -1, I'm assuming that

  • the queue can grow without bound (or until running out of memory or ....); and
  • there is still a maximum amount of parallelism, we don't need to worry about making 10,000 simultaneous S3 API calls. Is that correct-ish?

hartzell avatar Jan 29 '24 21:01 hartzell

Correct. The maximum number of concurrent threads is still limited by threadPool.S3TransferManager.maxThreads which defaults to max(10, n_cpus * 3)

bentsherman avatar Jan 29 '24 21:01 bentsherman

Thank you @bentsherman! That config option must be greater than 0, and setting to a sufficiently large value (such as 2^31 - 1) does indeed work around the problem.

kdykdy avatar Jan 29 '24 23:01 kdykdy