nextflow
nextflow copied to clipboard
google-batch workflow interruptions with "UNAVAILABLE: 502:Bad Gateway"
Bug report
When running workflows with google-batch, I sometimes get an UNAVAILABLE: 502:Bad Gateway
error which causes a workflow to be interrupted regardless of my error strategy. This is relatively rare, seemingly random, and happens mostly for large runs (usually thousands of jobs). The issue might also not be linked to nextflow itself but the google-batch service.
Finally, it's probably not related, but for the specific run I shared below, I had many jobs that were preempted (~800 among over ~6k submitted jobs).
Expected behavior and actual behavior
Expected: Normal execution of the workflow or more feedback about what the actual issue is. Actual: Workflow interrupted
Steps to reproduce the problem
Unable to reproduce, but happens fairly consistently for large scale runs (custom RNA-seq pipeline)
Program output
See attached .nextflow.log
Relevant lines:
Nov-20 18:27:15.840 [Task submitter] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=ROME_QUANT_HS:FASTQ_FILTER (PPMI-Phase1-IR1.51625.BL.0000430667.5104-SL-0473.L002); work-dir=gs://nxf-work/cedric/ROMEQuant/4b/98813e83210abb7fcba3c9c5261e40
error [com.google.api.gax.rpc.UnavailableException]: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway
Nov-20 18:27:15.857 [Task submitter] ERROR nextflow.processor.TaskProcessor - Error executing process > 'ROME_QUANT_HS:FASTQ_FILTER (PPMI-Phase1-IR1.51625.BL.0000430667.5104-SL-0473.L002)'
Caused by:
UNAVAILABLE: 502:Bad Gateway
com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:112)
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:41)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:86)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:67)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1132)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:574)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:544)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:541)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:576)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:757)
--
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1254)
at groovy.lang.MetaClassImpl.invokeMethodClosure(MetaClassImpl.java:1042)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1128)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1030)
at groovy.lang.Closure.call(Closure.java:427)
at groovy.lang.Closure.call(Closure.java:406)
at groovy.lang.Closure.run(Closure.java:498)
... 1 common frames omitted
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway
at io.grpc.Status.asRuntimeException(Status.java:539)
... 14 common frames omitted
Nov-20 18:27:15.866 [Task submitter] DEBUG nextflow.Session - Session aborted -- Cause: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway
Environment
- Nextflow version:
23.09.3-edge build 5888
- Java version:
openjdk version "11.0.20.1" 2023-08-24
- Operating system:
Linux nf-tower-main 5.15.0-1027-gcp #34~20.04.1-Ubuntu SMP Mon Jan 9 18:40:09 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
- Bash version:
zsh 5.8 (x86_64-ubuntu-linux-gnu)
@soj-hub we are seeing an increasing rate of 502 error from Google Batch. Is this expected? is there any advice on how to handle this condition?
Can you share the region of the jobs and some Job UIDs of successful and failed jobs? We can investigate.
Sure!
- The region is us-central1.
- Some jobs that succeeded are: nf-7aab7749-1700484953708, nf-aa9aa38d-1700484953749, nf-de4ba8c2-1700484953788
- Some jobs that failed due to preemption (exit code 50001): nf-3bc01a19-1700484952944, nf-dbbeed31-1700484912379, nf-0db823af-1700484911676
- I can't have a run ID for the run that failed with 502, maybe it wasn't assigned one yet?
+1, this terminally interrupts workflows, and anecdotally seems to happen more on large workflows than small ones. I'm executing pipelines with -resume
to pick up and (hopefully) finish jobs. I updated today to 23.11.0-edge
to resolve the Google Batch spot-instance pre-emption errors -- thanks for all your work on the Batch executor so far! (Edit: switching back to the google-lifesciences executor is another way to work-around the Batch executor issues)
Steps to reproduce
I'm not able to reproduce this deterministically.
Stack trace
com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:112)
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:41)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:86)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:84)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1127)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807)
at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:568)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:538)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:570)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:567)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:71)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:735)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:716)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Suppressed: com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57)
at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112)
at com.google.cloud.batch.v1.BatchServiceClient.createJob(BatchServiceClient.java:316)
at com.google.cloud.batch.v1.BatchServiceClient.createJob(BatchServiceClient.java:248)
at nextflow.cloud.google.batch.client.BatchClient.submitJob(BatchClient.groovy:85)
at nextflow.cloud.google.batch.GoogleBatchTaskHandler.submit(GoogleBatchTaskHandler.groovy:136)
at nextflow.processor.TaskPollingMonitor.submit(TaskPollingMonitor.groovy:196)
at nextflow.processor.TaskPollingMonitor.submitPendingTasks(TaskPollingMonitor.groovy:565)
at nextflow.processor.TaskPollingMonitor.submitLoop(TaskPollingMonitor.groovy:390)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1254)
at groovy.lang.MetaClassImpl.invokeMethodClosure(MetaClassImpl.java:1042)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1128)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1030)
at groovy.lang.Closure.call(Closure.java:427)
at groovy.lang.Closure.call(Closure.java:406)
at groovy.lang.Closure.run(Closure.java:498)
... 1 common frames omitted
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway
at io.grpc.Status.asRuntimeException(Status.java:537)
... 14 common frames omitted
Dec-05 21:11:39.901 [Task submitter] DEBUG nextflow.Session - Session aborted -- Cause: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway
Environment
NextFlow version 23.11.0-edge build 5891
openjdk version "11.0.19" 2023-04-18
OS: Ubuntu 18.04.6 LTS
shell: GNU bash, version 4.4.20(1)-release (x86_64-pc-linux-gnu)
NF config:
google {
project = 'xxx'
region = 'us-central1,us-east1,us-east4,us-west1,us-west2'
batch.preemptible = true
batch.bootDiskSize = '20 GB'
batch.debug = true
}
executor {
pollInterval = '5 sec'
dumpInterval = '10 sec'
submitRateLimit = '10 sec'
}
Some jobs for the same sample that succeeded: nf-6ee35e49-1701807374390, nf-6b52752f-1701807374422, nf-6a013421-1701806533653. As above, the log does not contain a nf-
ID for the failing job. It would be great if this were caught and handled as an individual task-level failure, so it could be governed by retry policy.
We missed this issue. I'd like to ask if this is still ongoing? If so, can you share some recent job UIDs and the region so we can investigate?
I have not done any large runs recently, and it didn't happen. I guess we could close this issue for now, and I can reopen it if this happens again (and share UIDs at the same time)
I reverted my executor to google-lifesciences for the time being, but will try google batch again once I've updated to NF v24 -- will keep you posted 🤞
@bentsherman I think we should consider a try policy in this scenario. If I'm not wrong the Google SDK has a built-it capability for that
@soj-hub, this error occurred again today (job UID: nf-3c8dd1e6-171027-ba6f08ab-72f4-49390):
Mar-13 05:13:50.241 [Task submitter] INFO nextflow.Session - [7c/8b3ddc] Submitted process > ROME_QUANT_HS:L1EM_QUANT (SRR11582419)
Mar-13 05:13:50.374 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=ROME_QUANT_HS:ALUEM_QUANT (SRR6761453); work-dir=gs://nxf-work/romequant/cedric/3c/8dd1e638c191fc08f1ce800c1324ea
error [com.google.api.gax.rpc.UnavailableException]: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway
Mar-13 05:13:50.396 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'ROME_QUANT_HS:ALUEM_QUANT (SRR6761453)'
Caused by:
UNAVAILABLE: 502:Bad Gateway
com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:112)
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:41)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:86)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:84)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1127)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807)
at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:568)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:538)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:570)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:567)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:71)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:735)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:716)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Suppressed: com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57)
at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112)
at com.google.cloud.batch.v1.BatchServiceClient.getJob(BatchServiceClient.java:427)
at com.google.cloud.batch.v1.BatchServiceClient.getJob(BatchServiceClient.java:373)
at nextflow.cloud.google.batch.client.BatchClient.describeJob(BatchClient.groovy:91)
at nextflow.cloud.google.batch.client.BatchClient.getJobStatus(BatchClient.groovy:107)
at nextflow.cloud.google.batch.GoogleBatchTaskHandler.getJobState(GoogleBatchTaskHandler.groovy:386)
at nextflow.cloud.google.batch.GoogleBatchTaskHandler.checkIfCompleted(GoogleBatchTaskHandler.groovy:421)
at nextflow.processor.TaskPollingMonitor.checkTaskStatus(TaskPollingMonitor.groovy:615)
at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:537)
at nextflow.processor.TaskPollingMonitor.pollLoop(TaskPollingMonitor.groovy:412)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
at nextflow.processor.TaskPollingMonitor$_start_closure2.doCall(TaskPollingMonitor.groovy:293)
at nextflow.processor.TaskPollingMonitor$_start_closure2.call(TaskPollingMonitor.groovy)
at groovy.lang.Closure.run(Closure.java:505)
... 1 common frames omitted
@marcodelapierre think we need to schedule this as an improvement
Not sure if this related or not, but when resuming the workflow above, it got interrupted again with a different error:
com.google.api.client.http.HttpResponseException: 503 Service Unavailable
POST https://storage.googleapis.com/upload/storage/v1/b/nxf-work/o?name=romequant/cedric/7f/52cac7d420e6b21872b35382d39bf2/.command.sh&uploadType=resumable
Service Unavailable
I've added a retry policy to handle this error condition. Likely solved c4981dcc.
Hi,
It happened again on a large workflow run (~50k samples) with nextflow version 24.03.0-edge build 5908
. Here's the relevant log section:
May-12 14:43:58.393 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 2397; name: ROME_QUANT_HS:FASTERQ_DUMP (SRR21916663); status: COMPLETED; exit: 0; error: -; workDir: gs://nxf-work/romequant/cedric/08/19>
May-12 14:43:58.462 [Task submitter] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=ROME_QUANT_HS:FASTERQ_DUMP (SRR3709490); work-dir=gs://nxf-work/romequant/cedric/83/dff585c2550590ed6cb788eb6e1963
error [com.google.api.client.http.HttpResponseException]: 503 Service Unavailable
POST https://storage.googleapis.com/upload/storage/v1/b/nxf-work/o?name=romequant/cedric/83/dff585c2550590ed6cb788eb6e1963/.command.sh&uploadType=resumable
Service Unavailable
May-12 14:43:58.470 [Task submitter] ERROR nextflow.processor.TaskProcessor - Error executing process > 'ROME_QUANT_HS:FASTERQ_DUMP (SRR3709490)'
Caused by:
503 Service Unavailable
POST https://storage.googleapis.com/upload/storage/v1/b/nxf-work/o?name=romequant/cedric/83/dff585c2550590ed6cb788eb6e1963/.command.sh&uploadType=resumable
Service Unavailable
com.google.api.client.http.HttpResponseException: 503 Service Unavailable
POST https://storage.googleapis.com/upload/storage/v1/b/nxf-work/o?name=romequant/cedric/83/dff585c2550590ed6cb788eb6e1963/.command.sh&uploadType=resumable
Service Unavailable
at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1118)
at com.google.cloud.storage.spi.v1.HttpStorageRpc.open(HttpStorageRpc.java:1000)
at com.google.cloud.storage.ResumableMedia.lambda$null$0(ResumableMedia.java:37)
at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
at com.google.cloud.storage.Retrying.run(Retrying.java:54)
at com.google.cloud.storage.ResumableMedia.lambda$startUploadForBlobInfo$1(ResumableMedia.java:34)
at com.google.cloud.storage.BlobWriteChannel$Builder.build(BlobWriteChannel.java:315)
at com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:609)
at com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:579)
at com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:101)
at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newWriteChannel(CloudStorageFileSystemProvider.java:467)
at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newByteChannel(CloudStorageFileSystemProvider.java:299)
at java.base/java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:484)
at java.base/java.nio.file.Files.newOutputStream(Files.java:228)
at java.base/java.nio.file.Files.newBufferedWriter(Files.java:3008)
at java.base/java.nio.file.Files.newBufferedWriter(Files.java:3056)
at nextflow.executor.BashWrapperBuilder.write0(BashWrapperBuilder.groovy:445)
at nextflow.executor.BashWrapperBuilder.build(BashWrapperBuilder.groovy:425)
at nextflow.cloud.google.batch.GoogleBatchTaskHandler.submit(GoogleBatchTaskHandler.groovy:151)
at nextflow.processor.TaskPollingMonitor.submit(TaskPollingMonitor.groovy:196)
at nextflow.processor.TaskPollingMonitor.submitPendingTasks(TaskPollingMonitor.groovy:565)
at nextflow.processor.TaskPollingMonitor.submitLoop(TaskPollingMonitor.groovy:390)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
at groovy.lang.MetaClassImpl.invokeMethodClosure(MetaClassImpl.java:1017)
at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1207)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
at groovy.lang.Closure.call(Closure.java:433)
at groovy.lang.Closure.call(Closure.java:412)
at groovy.lang.Closure.run(Closure.java:505)
at java.base/java.lang.Thread.run(Thread.java:840)
May-12 14:43:58.479 [Task submitter] DEBUG nextflow.Session - Session aborted -- Cause: 503 Service Unavailable
POST https://storage.googleapis.com/upload/storage/v1/b/nxf-work/o?name=romequant/cedric/83/dff585c2550590ed6cb788eb6e1963/.command.sh&uploadType=resumable
Service Unavailable
May-12 14:43:58.507 [Task submitter] DEBUG nextflow.Session - The following nodes are still active:
Of note, we also observed the 503 error as well on run of 80k samples. I can provide additional information if needed.
If you could provide the .nexflow.log
file it would help
Here's the full log file for me: nextflow.log
We added retry logic to the google batch client when managing tasks, but now the error still happens when writing the wrapper script:
com.google.api.client.http.HttpResponseException: 503 Service Unavailable
POST https://storage.googleapis.com/upload/storage/v1/b/nxf-work/o?name=romequant/cedric/83/dff585c2550590ed6cb788eb6e1963/.command.sh&uploadType=resumable
Service Unavailable
at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1118)
at com.google.cloud.storage.spi.v1.HttpStorageRpc.open(HttpStorageRpc.java:1000)
at com.google.cloud.storage.ResumableMedia.lambda$null$0(ResumableMedia.java:37)
at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
at com.google.cloud.storage.Retrying.run(Retrying.java:54)
at com.google.cloud.storage.ResumableMedia.lambda$startUploadForBlobInfo$1(ResumableMedia.java:34)
at com.google.cloud.storage.BlobWriteChannel$Builder.build(BlobWriteChannel.java:315)
at com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:609)
at com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:579)
at com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:101)
at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newWriteChannel(CloudStorageFileSystemProvider.java:467)
at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newByteChannel(CloudStorageFileSystemProvider.java:299)
at java.base/java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:484)
at java.base/java.nio.file.Files.newOutputStream(Files.java:228)
at java.base/java.nio.file.Files.newBufferedWriter(Files.java:3008)
at java.base/java.nio.file.Files.newBufferedWriter(Files.java:3056)
at nextflow.executor.BashWrapperBuilder.write0(BashWrapperBuilder.groovy:445)
at nextflow.executor.BashWrapperBuilder.build(BashWrapperBuilder.groovy:425)
at nextflow.cloud.google.batch.GoogleBatchTaskHandler.submit(GoogleBatchTaskHandler.groovy:151)
So we also need to handle this error here:
https://github.com/nextflow-io/nextflow/blob/0863a65a3175721b7d929d7e991b4b227706fdef/modules/nextflow/src/main/groovy/nextflow/executor/BashWrapperBuilder.groovy#L446-L462