nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

nextflow 24.0.4.4 never exit due to incomplete file transfer

Open divinomas-gh opened this issue 1 year ago • 6 comments

Bug report

Expected behavior and actual behavior

I have used nextflow 22.10.6.5843, which runs smoothly. After I updated my nextflow to v24.0.4.4, the same script hangs with some files not finished for transferring. The files to be transferred are totally around 50Gb.

Steps to reproduce the problem

Program output

Oct-03 12:42:12.157 [main] DEBUG nextflow.Session - Session await > all processes finished Oct-03 12:42:17.082 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: slurm) - terminating tasks monitor poll loop Oct-03 12:42:17.082 [main] DEBUG nextflow.Session - Session await > all barriers passed Oct-03 12:42:17.093 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'TaskFinalizer' shutdown completed (hard=false) Oct-03 12:42:22.095 [main] INFO nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (7 files) Oct-03 12:43:22.102 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (7 files) Oct-03 12:44:22.104 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (6 files) Oct-03 12:45:22.106 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (6 files) Oct-03 12:46:22.108 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (6 files) Oct-03 12:47:22.110 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (6 files) Oct-03 12:48:22.112 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (4 files) Oct-03 12:49:22.114 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (3 files) Oct-03 12:50:22.116 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (3 files) ....... ....... ....... Oct-04 00:41:23.430 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (3 files) Oct-04 00:42:18.432 [main] WARN nextflow.util.ThreadPoolHelper - Exiting before file transfers were completed -- Some files may be lost Oct-04 00:42:18.432 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'PublishDir' shutdown completed (hard=false) Oct-04 00:42:18.463 [main] DEBUG n.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=32; failedCount=0; ignoredCount=0; cachedCount=0; pendingCount=0; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=226d 11h 31m 9s; failedDuration=0ms; cachedDuration=0ms;loadCpus=0; loadMemory=0; peakRunning=5; peakCpus=125; peakMemory=0; ] Oct-04 00:42:18.733 [main] DEBUG nextflow.cache.CacheDB - Closing CacheDB done Oct-04 00:42:18.820 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'FileTransfer' shutdown completed (hard=false) Oct-04 00:42:18.820 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye

Environment

  • Nextflow version: 24.0.4.4
  • Java version: openjdk version "20.0.2-internal" 2023-07-18
  • Operating system: linux
  • Bash version: GNU bash, version 5.1.16(1)-release

Additional context

divinomas-gh avatar Oct 04 '24 00:10 divinomas-gh

Did it hang or did it just exit without finishing all of the file transfers? Your issue title suggests the former but your log suggests the latter

bentsherman avatar Oct 04 '24 12:10 bentsherman

Did it hang or did it just exit without finishing all of the file transfers? Your issue title suggests the former but your log suggests the latter

It hangs for ~12 hours, then shows "Exiting before file transfers were completed -- Some files may be lost" message, and then hangs without exit for days.

divinomas-gh avatar Oct 12 '24 10:10 divinomas-gh

Then it looks like one of the file uploads hung up. Nextflow will timeout after 12 hours so that is the expected behavior. As for the file upload, it's hard to know the root cause. I would see if it happens consistently first. If not, it might be some intermittent networking issue

bentsherman avatar Oct 13 '24 13:10 bentsherman

We are encountering the same issue using both 24.04 and the latest edge release. Is there an option to "retry" the file transfer after it hangs x amount of time?

matthdsm avatar Oct 15 '24 06:10 matthdsm

@matthdsm does it happen consistently? and are you saying it doesn't happen for other versions?

bentsherman avatar Oct 16 '24 15:10 bentsherman

it happens often, but not consistently. We've started noticing the phenomenon after updating to 24.04, but I not a 100% sure it didn't happend before.

matthdsm avatar Oct 16 '24 17:10 matthdsm

I'm experiencing the same issue. It does not happen for every workflow but it seems to happen consistently for one of the workflows we run.

@matthdsm What did you do in the end to resolve the issue?

tverbeiren avatar Jan 22 '25 07:01 tverbeiren

Are you able to include the jstack of the hanging (linux) process?

pditommaso avatar Jan 22 '25 07:01 pditommaso

See attached, does that suffice @pditommaso ?

jstack-failed-publishdir.txt

tverbeiren avatar Jan 22 '25 07:01 tverbeiren

There are two threads in waiting status for publishing data. Still don't know the reason

"PublishDir-2" #22947 prio=5 os_prio=0 cpu=12174.17ms elapsed=48025.04s tid=0x00007fecd80e7010 nid=0x5b05 waiting on condition  [0x00007fed75ef4000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x0000000579d180e0> (a java.util.concurrent.CountDownLatch$Sync)
	at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:211)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:715)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly([email protected]/AbstractQueuedSynchronizer.java:1047)
	at java.util.concurrent.CountDownLatch.await([email protected]/CountDownLatch.java:230)
	at com.amazonaws.services.s3.transfer.MultipleFileTransferStateChangeListener.transferStateChanged(MultipleFileTransferStateChangeListener.java:40)
	at com.amazonaws.services.s3.transfer.internal.AbstractTransfer.setState(AbstractTransfer.java:165)
	at com.amazonaws.services.s3.transfer.internal.UploadCallable.call(UploadCallable.java:144)
	at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:115)
	at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:45)
	at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
	at java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution([email protected]/ThreadPoolExecutor.java:2037)
	at java.util.concurrent.ThreadPoolExecutor.reject([email protected]/ThreadPoolExecutor.java:833)
	at java.util.concurrent.ThreadPoolExecutor.execute([email protected]/ThreadPoolExecutor.java:1365)
	at java.util.concurrent.AbstractExecutorService.submit([email protected]/AbstractExecutorService.java:145)
	at com.amazonaws.services.s3.transfer.internal.UploadMonitor.create(UploadMonitor.java:95)
	at com.amazonaws.services.s3.transfer.TransferManager.doUpload(TransferManager.java:701)
	at com.amazonaws.services.s3.transfer.TransferManager.uploadFileList(TransferManager.java:1935)
	at com.amazonaws.services.s3.transfer.TransferManager.uploadDirectory(TransferManager.java:1693)
	at nextflow.cloud.aws.nio.S3Client.uploadDirectory(S3Client.java:635)
	at nextflow.cloud.aws.nio.S3FileSystemProvider.upload(S3FileSystemProvider.java:335)
	at nextflow.file.FileHelper.copyPath(FileHelper.groovy:998)
	at nextflow.processor.PublishDir.processFileImpl(PublishDir.groovy:507)
	at nextflow.processor.PublishDir.processFile(PublishDir.groovy:406)
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0([email protected]/Native Method)
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke([email protected]/NativeMethodAccessorImpl.java:77)
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke([email protected]/DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke([email protected]/Method.java:569)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
	at nextflow.processor.PublishDir$_retryableProcessFile_closure2.doCall(PublishDir.groovy:397)
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0([email protected]/Native Method)
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke([email protected]/NativeMethodAccessorImpl.java:77)
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke([email protected]/DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke([email protected]/Method.java:569)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:279)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at groovy.lang.Closure.call(Closure.java:433)
	at org.codehaus.groovy.runtime.ConvertedClosure.invokeCustom(ConvertedClosure.java:52)
	at org.codehaus.groovy.runtime.ConversionHandler.invoke(ConversionHandler.java:113)
	at jdk.proxy2.$Proxy67.get(jdk.proxy2/Unknown Source)
	at dev.failsafe.Functions.lambda$get$0(Functions.java:46)
	at dev.failsafe.Functions$$Lambda$643/0x00007fed7d176bc0.apply(Unknown Source)
	at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:75)
	at dev.failsafe.internal.RetryPolicyExecutor$$Lambda$647/0x00007fed7d1778e8.apply(Unknown Source)
	at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:176)
	at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:437)
	at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:129)
	at nextflow.processor.PublishDir.retryableProcessFile(PublishDir.groovy:396)
	at nextflow.processor.PublishDir.safeProcessFile(PublishDir.groovy:367)
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0([email protected]/Native Method)
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke([email protected]/NativeMethodAccessorImpl.java:77)
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke([email protected]/DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke([email protected]/Method.java:569)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
	at nextflow.processor.PublishDir$_apply1_closure1.doCall(PublishDir.groovy:342)
	at nextflow.processor.PublishDir$_apply1_closure1.call(PublishDir.groovy)
	at groovy.lang.Closure.run(Closure.java:505)
	at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:539)
	at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
	at java.lang.Thread.run([email protected]/Thread.java:840)

"PublishDir-4" #23750 prio=5 os_prio=0 cpu=688.14ms elapsed=47228.99s tid=0x00007fed08014bc0 nid=0x5e2a waiting on condition  [0x00007fec8acfb000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x000000057c9a96d8> (a java.util.concurrent.CountDownLatch$Sync)
	at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:211)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:715)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly([email protected]/AbstractQueuedSynchronizer.java:1047)
	at java.util.concurrent.CountDownLatch.await([email protected]/CountDownLatch.java:230)
	at com.amazonaws.services.s3.transfer.MultipleFileTransferStateChangeListener.transferStateChanged(MultipleFileTransferStateChangeListener.java:40)
	at com.amazonaws.services.s3.transfer.internal.AbstractTransfer.setState(AbstractTransfer.java:165)
	at com.amazonaws.services.s3.transfer.internal.UploadCallable.call(UploadCallable.java:144)
	at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:115)
	at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:45)
	at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
	at java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution([email protected]/ThreadPoolExecutor.java:2037)
	at java.util.concurrent.ThreadPoolExecutor.reject([email protected]/ThreadPoolExecutor.java:833)
	at java.util.concurrent.ThreadPoolExecutor.execute([email protected]/ThreadPoolExecutor.java:1365)
	at java.util.concurrent.AbstractExecutorService.submit([email protected]/AbstractExecutorService.java:145)
	at com.amazonaws.services.s3.transfer.internal.UploadMonitor.create(UploadMonitor.java:95)
	at com.amazonaws.services.s3.transfer.TransferManager.doUpload(TransferManager.java:701)
	at com.amazonaws.services.s3.transfer.TransferManager.uploadFileList(TransferManager.java:1935)
	at com.amazonaws.services.s3.transfer.TransferManager.uploadDirectory(TransferManager.java:1693)
	at nextflow.cloud.aws.nio.S3Client.uploadDirectory(S3Client.java:635)
	at nextflow.cloud.aws.nio.S3FileSystemProvider.upload(S3FileSystemProvider.java:335)
	at nextflow.file.FileHelper.copyPath(FileHelper.groovy:998)
	at nextflow.processor.PublishDir.processFileImpl(PublishDir.groovy:507)
	at nextflow.processor.PublishDir.processFile(PublishDir.groovy:406)
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0([email protected]/Native Method)
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke([email protected]/NativeMethodAccessorImpl.java:77)
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke([email protected]/DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke([email protected]/Method.java:569)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
	at nextflow.processor.PublishDir$_retryableProcessFile_closure2.doCall(PublishDir.groovy:397)
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0([email protected]/Native Method)
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke([email protected]/NativeMethodAccessorImpl.java:77)
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke([email protected]/DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke([email protected]/Method.java:569)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:279)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at groovy.lang.Closure.call(Closure.java:433)
	at org.codehaus.groovy.runtime.ConvertedClosure.invokeCustom(ConvertedClosure.java:52)
	at org.codehaus.groovy.runtime.ConversionHandler.invoke(ConversionHandler.java:113)
	at jdk.proxy2.$Proxy67.get(jdk.proxy2/Unknown Source)
	at dev.failsafe.Functions.lambda$get$0(Functions.java:46)
	at dev.failsafe.Functions$$Lambda$643/0x00007fed7d176bc0.apply(Unknown Source)
	at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:75)
	at dev.failsafe.internal.RetryPolicyExecutor$$Lambda$647/0x00007fed7d1778e8.apply(Unknown Source)
	at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:176)
	at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:437)
	at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:129)
	at nextflow.processor.PublishDir.retryableProcessFile(PublishDir.groovy:396)
	at nextflow.processor.PublishDir.safeProcessFile(PublishDir.groovy:367)
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0([email protected]/Native Method)
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke([email protected]/NativeMethodAccessorImpl.java:77)
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke([email protected]/DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke([email protected]/Method.java:569)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
	at nextflow.processor.PublishDir$_apply1_closure1.doCall(PublishDir.groovy:342)
	at nextflow.processor.PublishDir$_apply1_closure1.call(PublishDir.groovy)
	at groovy.lang.Closure.run(Closure.java:505)
	at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:539)
	at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
	at java.lang.Thread.run([email protected]/Thread.java:840)

pditommaso avatar Jan 22 '25 09:01 pditommaso

Interestingly both are executed via CallerRunsPolicy.rejectedExecution, that may cause to break the CountDownLatch synchronization.

pditommaso avatar Jan 22 '25 09:01 pditommaso

This is what I think it is happening: I think it is uploading a directory with a lot of files. One of the executions of this upload is rejected (I guess because of threads and queue limits). The callersRunPolicy runs com.amazonaws.services.s3.transfer.internal.UploadMonitor.call synchronously in the same thread. but the AWS client always expects the to be executed async at least for the case of uploading directories that use the uploadFileList. It has a CountDownLatch to wait for all executions that is the one that is blocked on wait and can't be counted down because the code never reaches this part. If the queue size for the S3 thread pool is increased this could be mitigated, but maybe we could change the default RejectedExecutionHandler by something like the AbortPolicy. It will produce an exception and the transfer will be retried by the failsafe management.

jorgee avatar Jan 22 '25 13:01 jorgee

I don't think it has anything to do with the S3 transfer, because we're seeing this issue on a shared FS too.

matthdsm avatar Jan 22 '25 13:01 matthdsm

@jorgee was suspecting something like that. What about using maxThreads and the queueSize? the idea would be to block the submitting thread instead of queue too many upload tasks upload causing the invocation of rejectionPolicy.

https://github.com/nextflow-io/nextflow/blob/0e30a8f0b50ba3e800631b6708f100cf2c7b9f33/modules/nextflow/src/main/groovy/nextflow/util/ThreadPoolManager.groovy#L98-L99

pditommaso avatar Jan 22 '25 13:01 pditommaso

Would it help to break up directory publishing by publishing individual files instead? See #3933

bentsherman avatar Jan 22 '25 13:01 bentsherman

Thanks for your feedback!

@bentsherman Your PR is still open. Do you happen to have a nightly build or something I can try and test it? Reconfiguring the workflow so it outputs individual files is not an option, I'm afraid. We just spent quite some time tuning the output structure according to our needs.

@pditommaso @jorgee What could I try setting then?

  • aws.client.uploadMaxThreads
  • (in pre-run script) NXF_OPTS="-Dnxf.pool.maxThreads=???"
  • executor.queueSize

And what would be good values for these?

tverbeiren avatar Jan 22 '25 13:01 tverbeiren

(...) but maybe we could change the default RejectedExecutionHandler by something like the AbortPolicy. It will produce an exception and the transfer will be retried by the failsafe management.

I would be very much in favour of this option. The situation right now is the worst possible one: users think the pipeline ran successfully but no data is written out. And when they resume the pipeline (in order to pick up cached tasks and just try to re-publish the files) it turns out no caching information is available because the head job did not finish properly.

tverbeiren avatar Jan 22 '25 14:01 tverbeiren

@tverbeiren can you please try the following settings?

threadPool.S3TransferManager.maxThreads = <num cpus * 3>
threadPool.S3TransferManager.maxQueueSize = <num cpus * 3>

pditommaso avatar Jan 22 '25 15:01 pditommaso

These are the current defaults, maxQueueSize must be bigger in your case.

DEFAULT_MIN_THREAD = 10
DEFAULT_MAX_THREAD = Math.max(DEFAULT_MIN_THREAD, Runtime.runtime.availableProcessors()*3)
DEFAULT_QUEUE_SIZE = 10_000

jorgee avatar Jan 22 '25 16:01 jorgee

I claim maxQueueSize should be the same as maxThreads

pditommaso avatar Jan 22 '25 16:01 pditommaso

If I correctly understood the rejection mechanism, it is mainly when threads and queues are full. So, if we reduce the queue, the rejection should happen earlier.

jorgee avatar Jan 22 '25 17:01 jorgee

Think you are right, my assumption it was used a blocking queue that prevent more jobs to be added once it's full.

We may need to recover this implementation https://github.com/nextflow-io/nextflow/blob/c0e2aa7be9a2501674dddeeadf65bc91b0f9e782/modules/nextflow/src/main/groovy/nextflow/util/BlockingBlockingQueue.groovy#L33-L33

pditommaso avatar Jan 22 '25 18:01 pditommaso

Push a tentative solution https://github.com/nextflow-io/nextflow/pull/5700

pditommaso avatar Jan 23 '25 09:01 pditommaso

Using the following configuration, all files are properly published:

threadPool.S3TransferManager.maxThreads = <cpus * 3>
threadPool.S3TransferManager.maxQueueSize = 100000

Do you see any disadvantages in setting this for all our workflows (with a proper solution pending)?

tverbeiren avatar Jan 23 '25 10:01 tverbeiren

Do you see any disadvantages in setting this for all our workflows (with a proper solution pending)?

it should be a valid workaround

pditommaso avatar Jan 23 '25 11:01 pditommaso

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 27 '25 04:06 stale[bot]

Closing since we have a possible solution since 25.04 and haven't heard back. Feel free to re-open if you continue to see this issue in 25.04 or later

bentsherman avatar Nov 14 '25 21:11 bentsherman

This issue persists on shared FS (using 25.04.2). There was also another user reporting a similar issue in the #sarek channel over at nf-core just yesterday

matthdsm avatar Nov 15 '25 06:11 matthdsm

We're actively migrating to 25.10 and the new workflow output definitions to see if it resolves the issue

matthdsm avatar Nov 15 '25 06:11 matthdsm

This issue persists on shared FS (using 25.04.2). There was also another user reporting a similar issue in the #sarek channel over at nf-core just yesterday

From the information that it is channel, I cannot assess if it is the same issue. It appears to be an inconsistent state after a failure, but without providing details about the cause of the failure.

jorgee avatar Nov 17 '25 09:11 jorgee