nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

google-batch workflow interruptions with "UNAVAILABLE: 502:Bad Gateway"

Open Puumanamana opened this issue 1 year ago • 17 comments

Bug report

When running workflows with google-batch, I sometimes get an UNAVAILABLE: 502:Bad Gateway error which causes a workflow to be interrupted regardless of my error strategy. This is relatively rare, seemingly random, and happens mostly for large runs (usually thousands of jobs). The issue might also not be linked to nextflow itself but the google-batch service. Finally, it's probably not related, but for the specific run I shared below, I had many jobs that were preempted (~800 among over ~6k submitted jobs).

Expected behavior and actual behavior

Expected: Normal execution of the workflow or more feedback about what the actual issue is. Actual: Workflow interrupted

Steps to reproduce the problem

Unable to reproduce, but happens fairly consistently for large scale runs (custom RNA-seq pipeline)

Program output

See attached .nextflow.log

Relevant lines:

Nov-20 18:27:15.840 [Task submitter] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=ROME_QUANT_HS:FASTQ_FILTER (PPMI-Phase1-IR1.51625.BL.0000430667.5104-SL-0473.L002); work-dir=gs://nxf-work/cedric/ROMEQuant/4b/98813e83210abb7fcba3c9c5261e40
  error [com.google.api.gax.rpc.UnavailableException]: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway
Nov-20 18:27:15.857 [Task submitter] ERROR nextflow.processor.TaskProcessor - Error executing process > 'ROME_QUANT_HS:FASTQ_FILTER (PPMI-Phase1-IR1.51625.BL.0000430667.5104-SL-0473.L002)'

Caused by:
  UNAVAILABLE: 502:Bad Gateway

com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway
        at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:112)
        at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:41)
        at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:86)
        at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
        at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
        at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:67)
        at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1132)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
        at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:574)
        at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:544)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:541)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:576)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:757)
--
                at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
                at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
                at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1254)
                at groovy.lang.MetaClassImpl.invokeMethodClosure(MetaClassImpl.java:1042)
                at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1128)
                at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1030)
                at groovy.lang.Closure.call(Closure.java:427)
                at groovy.lang.Closure.call(Closure.java:406)
                at groovy.lang.Closure.run(Closure.java:498)
                ... 1 common frames omitted
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway
        at io.grpc.Status.asRuntimeException(Status.java:539)
        ... 14 common frames omitted
Nov-20 18:27:15.866 [Task submitter] DEBUG nextflow.Session - Session aborted -- Cause: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway

Environment

  • Nextflow version: 23.09.3-edge build 5888
  • Java version: openjdk version "11.0.20.1" 2023-08-24
  • Operating system: Linux nf-tower-main 5.15.0-1027-gcp #34~20.04.1-Ubuntu SMP Mon Jan 9 18:40:09 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Bash version: zsh 5.8 (x86_64-ubuntu-linux-gnu)

Puumanamana avatar Nov 22 '23 22:11 Puumanamana

@soj-hub we are seeing an increasing rate of 502 error from Google Batch. Is this expected? is there any advice on how to handle this condition?

pditommaso avatar Nov 23 '23 10:11 pditommaso

Can you share the region of the jobs and some Job UIDs of successful and failed jobs? We can investigate.

soj-hub avatar Nov 23 '23 15:11 soj-hub

Sure!

  • The region is us-central1.
  • Some jobs that succeeded are: nf-7aab7749-1700484953708, nf-aa9aa38d-1700484953749, nf-de4ba8c2-1700484953788
  • Some jobs that failed due to preemption (exit code 50001): nf-3bc01a19-1700484952944, nf-dbbeed31-1700484912379, nf-0db823af-1700484911676
  • I can't have a run ID for the run that failed with 502, maybe it wasn't assigned one yet?

Puumanamana avatar Nov 24 '23 12:11 Puumanamana

+1, this terminally interrupts workflows, and anecdotally seems to happen more on large workflows than small ones. I'm executing pipelines with -resume to pick up and (hopefully) finish jobs. I updated today to 23.11.0-edge to resolve the Google Batch spot-instance pre-emption errors -- thanks for all your work on the Batch executor so far! (Edit: switching back to the google-lifesciences executor is another way to work-around the Batch executor issues)

Steps to reproduce

I'm not able to reproduce this deterministically.

Stack trace

com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway
        at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:112)
        at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:41)
        at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:86)
        at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
        at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
        at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:84)
        at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1127)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807)
        at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:568)
        at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:538)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:570)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:567)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:71)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:735)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:716)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
        Suppressed: com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
                at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57)
                at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112)
                at com.google.cloud.batch.v1.BatchServiceClient.createJob(BatchServiceClient.java:316)
                at com.google.cloud.batch.v1.BatchServiceClient.createJob(BatchServiceClient.java:248)
                at nextflow.cloud.google.batch.client.BatchClient.submitJob(BatchClient.groovy:85)
                at nextflow.cloud.google.batch.GoogleBatchTaskHandler.submit(GoogleBatchTaskHandler.groovy:136)
                at nextflow.processor.TaskPollingMonitor.submit(TaskPollingMonitor.groovy:196)
                at nextflow.processor.TaskPollingMonitor.submitPendingTasks(TaskPollingMonitor.groovy:565)
                at nextflow.processor.TaskPollingMonitor.submitLoop(TaskPollingMonitor.groovy:390)
                at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.base/java.lang.reflect.Method.invoke(Method.java:566)
                at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
                at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
                at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1254)
                at groovy.lang.MetaClassImpl.invokeMethodClosure(MetaClassImpl.java:1042)
                at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1128)
                at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1030)
                at groovy.lang.Closure.call(Closure.java:427)
                at groovy.lang.Closure.call(Closure.java:406)
                at groovy.lang.Closure.run(Closure.java:498)
                ... 1 common frames omitted
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway
        at io.grpc.Status.asRuntimeException(Status.java:537)
        ... 14 common frames omitted
Dec-05 21:11:39.901 [Task submitter] DEBUG nextflow.Session - Session aborted -- Cause: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway

Environment

NextFlow version 23.11.0-edge build 5891
openjdk version "11.0.19" 2023-04-18
OS: Ubuntu 18.04.6 LTS
shell: GNU bash, version 4.4.20(1)-release (x86_64-pc-linux-gnu)

NF config:

google {
    project = 'xxx'
    region = 'us-central1,us-east1,us-east4,us-west1,us-west2'
    batch.preemptible = true
    batch.bootDiskSize = '20 GB'
    batch.debug = true
}

executor {
    pollInterval = '5 sec'
    dumpInterval = '10 sec'
    submitRateLimit = '10 sec'
}

Some jobs for the same sample that succeeded: nf-6ee35e49-1701807374390, nf-6b52752f-1701807374422, nf-6a013421-1701806533653. As above, the log does not contain a nf- ID for the failing job. It would be great if this were caught and handled as an individual task-level failure, so it could be governed by retry policy.

adamg-dzd avatar Dec 05 '23 21:12 adamg-dzd

We missed this issue. I'd like to ask if this is still ongoing? If so, can you share some recent job UIDs and the region so we can investigate?

soj-hub avatar Jan 29 '24 14:01 soj-hub

I have not done any large runs recently, and it didn't happen. I guess we could close this issue for now, and I can reopen it if this happens again (and share UIDs at the same time)

Puumanamana avatar Jan 29 '24 15:01 Puumanamana

I reverted my executor to google-lifesciences for the time being, but will try google batch again once I've updated to NF v24 -- will keep you posted 🤞

adamg-dzd avatar Feb 08 '24 17:02 adamg-dzd

@bentsherman I think we should consider a try policy in this scenario. If I'm not wrong the Google SDK has a built-it capability for that

pditommaso avatar Feb 09 '24 16:02 pditommaso

@soj-hub, this error occurred again today (job UID: nf-3c8dd1e6-171027-ba6f08ab-72f4-49390):

Mar-13 05:13:50.241 [Task submitter] INFO  nextflow.Session - [7c/8b3ddc] Submitted process > ROME_QUANT_HS:L1EM_QUANT (SRR11582419)
Mar-13 05:13:50.374 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=ROME_QUANT_HS:ALUEM_QUANT (SRR6761453); work-dir=gs://nxf-work/romequant/cedric/3c/8dd1e638c191fc08f1ce800c1324ea
  error [com.google.api.gax.rpc.UnavailableException]: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway
Mar-13 05:13:50.396 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'ROME_QUANT_HS:ALUEM_QUANT (SRR6761453)'

Caused by:
  UNAVAILABLE: 502:Bad Gateway

com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: 502:Bad Gateway
        at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:112)
        at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:41)
        at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:86)
        at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
        at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
        at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:84)
        at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1127)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807)
        at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:568)
        at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:538)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:570)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:567)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:71)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:735)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:716)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
        Suppressed: com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
                at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57)
                at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112)
                at com.google.cloud.batch.v1.BatchServiceClient.getJob(BatchServiceClient.java:427)
                at com.google.cloud.batch.v1.BatchServiceClient.getJob(BatchServiceClient.java:373)
                at nextflow.cloud.google.batch.client.BatchClient.describeJob(BatchClient.groovy:91)
                at nextflow.cloud.google.batch.client.BatchClient.getJobStatus(BatchClient.groovy:107)
                at nextflow.cloud.google.batch.GoogleBatchTaskHandler.getJobState(GoogleBatchTaskHandler.groovy:386)
                at nextflow.cloud.google.batch.GoogleBatchTaskHandler.checkIfCompleted(GoogleBatchTaskHandler.groovy:421)
                at nextflow.processor.TaskPollingMonitor.checkTaskStatus(TaskPollingMonitor.groovy:615)
                at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:537)
                at nextflow.processor.TaskPollingMonitor.pollLoop(TaskPollingMonitor.groovy:412)
                at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.base/java.lang.reflect.Method.invoke(Method.java:566)
                at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
                at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
                at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
                at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
                at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
                at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
                at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
                at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
                at nextflow.processor.TaskPollingMonitor$_start_closure2.doCall(TaskPollingMonitor.groovy:293)
                at nextflow.processor.TaskPollingMonitor$_start_closure2.call(TaskPollingMonitor.groovy)
                at groovy.lang.Closure.run(Closure.java:505)
                ... 1 common frames omitted

Puumanamana avatar Mar 13 '24 13:03 Puumanamana

@marcodelapierre think we need to schedule this as an improvement

pditommaso avatar Mar 14 '24 08:03 pditommaso

Not sure if this related or not, but when resuming the workflow above, it got interrupted again with a different error:

com.google.api.client.http.HttpResponseException: 503 Service Unavailable
POST https://storage.googleapis.com/upload/storage/v1/b/nxf-work/o?name=romequant/cedric/7f/52cac7d420e6b21872b35382d39bf2/.command.sh&uploadType=resumable
Service Unavailable

Puumanamana avatar Mar 14 '24 20:03 Puumanamana

I've added a retry policy to handle this error condition. Likely solved c4981dcc.

pditommaso avatar Mar 17 '24 17:03 pditommaso

Hi, It happened again on a large workflow run (~50k samples) with nextflow version 24.03.0-edge build 5908. Here's the relevant log section:

May-12 14:43:58.393 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 2397; name: ROME_QUANT_HS:FASTERQ_DUMP (SRR21916663); status: COMPLETED; exit: 0; error: -; workDir: gs://nxf-work/romequant/cedric/08/19>
May-12 14:43:58.462 [Task submitter] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=ROME_QUANT_HS:FASTERQ_DUMP (SRR3709490); work-dir=gs://nxf-work/romequant/cedric/83/dff585c2550590ed6cb788eb6e1963
  error [com.google.api.client.http.HttpResponseException]: 503 Service Unavailable
POST https://storage.googleapis.com/upload/storage/v1/b/nxf-work/o?name=romequant/cedric/83/dff585c2550590ed6cb788eb6e1963/.command.sh&uploadType=resumable
Service Unavailable
May-12 14:43:58.470 [Task submitter] ERROR nextflow.processor.TaskProcessor - Error executing process > 'ROME_QUANT_HS:FASTERQ_DUMP (SRR3709490)'

Caused by:
  503 Service Unavailable
POST https://storage.googleapis.com/upload/storage/v1/b/nxf-work/o?name=romequant/cedric/83/dff585c2550590ed6cb788eb6e1963/.command.sh&uploadType=resumable
Service Unavailable

com.google.api.client.http.HttpResponseException: 503 Service Unavailable
POST https://storage.googleapis.com/upload/storage/v1/b/nxf-work/o?name=romequant/cedric/83/dff585c2550590ed6cb788eb6e1963/.command.sh&uploadType=resumable
Service Unavailable
        at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293)
        at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1118)
        at com.google.cloud.storage.spi.v1.HttpStorageRpc.open(HttpStorageRpc.java:1000)
        at com.google.cloud.storage.ResumableMedia.lambda$null$0(ResumableMedia.java:37)
        at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
        at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
        at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
        at com.google.cloud.storage.Retrying.run(Retrying.java:54)
        at com.google.cloud.storage.ResumableMedia.lambda$startUploadForBlobInfo$1(ResumableMedia.java:34)
        at com.google.cloud.storage.BlobWriteChannel$Builder.build(BlobWriteChannel.java:315)
        at com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:609)
        at com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:579)
        at com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:101)
        at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newWriteChannel(CloudStorageFileSystemProvider.java:467)
        at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newByteChannel(CloudStorageFileSystemProvider.java:299)
        at java.base/java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:484)
        at java.base/java.nio.file.Files.newOutputStream(Files.java:228)
        at java.base/java.nio.file.Files.newBufferedWriter(Files.java:3008)
        at java.base/java.nio.file.Files.newBufferedWriter(Files.java:3056)
        at nextflow.executor.BashWrapperBuilder.write0(BashWrapperBuilder.groovy:445)
        at nextflow.executor.BashWrapperBuilder.build(BashWrapperBuilder.groovy:425)
        at nextflow.cloud.google.batch.GoogleBatchTaskHandler.submit(GoogleBatchTaskHandler.groovy:151)
        at nextflow.processor.TaskPollingMonitor.submit(TaskPollingMonitor.groovy:196)
        at nextflow.processor.TaskPollingMonitor.submitPendingTasks(TaskPollingMonitor.groovy:565)
        at nextflow.processor.TaskPollingMonitor.submitLoop(TaskPollingMonitor.groovy:390)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
        at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
        at groovy.lang.MetaClassImpl.invokeMethodClosure(MetaClassImpl.java:1017)
        at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1207)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
        at groovy.lang.Closure.call(Closure.java:433)
        at groovy.lang.Closure.call(Closure.java:412)
        at groovy.lang.Closure.run(Closure.java:505)
        at java.base/java.lang.Thread.run(Thread.java:840)
May-12 14:43:58.479 [Task submitter] DEBUG nextflow.Session - Session aborted -- Cause: 503 Service Unavailable
POST https://storage.googleapis.com/upload/storage/v1/b/nxf-work/o?name=romequant/cedric/83/dff585c2550590ed6cb788eb6e1963/.command.sh&uploadType=resumable
Service Unavailable
May-12 14:43:58.507 [Task submitter] DEBUG nextflow.Session - The following nodes are still active:

Puumanamana avatar May 12 '24 14:05 Puumanamana

Of note, we also observed the 503 error as well on run of 80k samples. I can provide additional information if needed.

vforget avatar May 13 '24 02:05 vforget

If you could provide the .nexflow.log file it would help

pditommaso avatar May 13 '24 07:05 pditommaso

Here's the full log file for me: nextflow.log

Puumanamana avatar May 13 '24 12:05 Puumanamana

We added retry logic to the google batch client when managing tasks, but now the error still happens when writing the wrapper script:

com.google.api.client.http.HttpResponseException: 503 Service Unavailable
POST https://storage.googleapis.com/upload/storage/v1/b/nxf-work/o?name=romequant/cedric/83/dff585c2550590ed6cb788eb6e1963/.command.sh&uploadType=resumable
Service Unavailable
	at com.google.api.client.http.HttpResponseException$Builder.build(HttpResponseException.java:293)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1118)
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.open(HttpStorageRpc.java:1000)
	at com.google.cloud.storage.ResumableMedia.lambda$null$0(ResumableMedia.java:37)
	at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
	at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
	at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
	at com.google.cloud.storage.Retrying.run(Retrying.java:54)
	at com.google.cloud.storage.ResumableMedia.lambda$startUploadForBlobInfo$1(ResumableMedia.java:34)
	at com.google.cloud.storage.BlobWriteChannel$Builder.build(BlobWriteChannel.java:315)
	at com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:609)
	at com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:579)
	at com.google.cloud.storage.StorageImpl.writer(StorageImpl.java:101)
	at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newWriteChannel(CloudStorageFileSystemProvider.java:467)
	at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newByteChannel(CloudStorageFileSystemProvider.java:299)
	at java.base/java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:484)
	at java.base/java.nio.file.Files.newOutputStream(Files.java:228)
	at java.base/java.nio.file.Files.newBufferedWriter(Files.java:3008)
	at java.base/java.nio.file.Files.newBufferedWriter(Files.java:3056)
	at nextflow.executor.BashWrapperBuilder.write0(BashWrapperBuilder.groovy:445)
	at nextflow.executor.BashWrapperBuilder.build(BashWrapperBuilder.groovy:425)
	at nextflow.cloud.google.batch.GoogleBatchTaskHandler.submit(GoogleBatchTaskHandler.groovy:151)

So we also need to handle this error here:

https://github.com/nextflow-io/nextflow/blob/0863a65a3175721b7d929d7e991b4b227706fdef/modules/nextflow/src/main/groovy/nextflow/executor/BashWrapperBuilder.groovy#L446-L462

bentsherman avatar May 13 '24 13:05 bentsherman