nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

[Azure Batch]: TimeoutException when Nextflow attempts to read files following task completion for large-scale analyses in Azure

Open apetkau opened this issue 2 weeks ago • 8 comments

Bug report

We are using Nextflow with Azure batch to process collections of microbial genomes (whole-genome sequence data). We have began testing out processing larger collections of genomes and have been encountering issues with some of the tasks run by Nextflow that cause Nextflow to fail the task with a java.util.concurrent.TimeoutException. This primarily occurs when attempting to read the .exitcode from a task from blob storage, which causes Nextflow to fail the task/return Integer.MAX_INTEGER as the exit code. For example (see below for more context in error message):

Jun... [Task monitor] DEBUG n.c.azure.batch.AzBatchTaskHandler - [AZURE BATCH] Cannot read exit status for task: `MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...)` | java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
...
Jun... [TaskFinalizer-10] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...); work-dir=az://...
  error [nextflow.exception.ProcessFailedException]: Process `MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...)` terminated for an unknown reason -- Likely it has been terminated by the external system

This behavior occurs only when scaling up to many genomes, and only impacts some of our runs. It also seems to impact random tasks/processes (in the above case, it is when running QUAST, but it occurs for random processes in the full pipeline). The pipeline we are using is https://github.com/phac-nml/mikrokondo/. I have observed it in other pipelines, but much less frequently. I believe it occurs in this pipeline since it does a lot of processing/may take up to 2 hours to process a genome.

Expected behavior and actual behavior

I would expect all of our pipeline executions to complete successfully and for there to be no TimeoutExceptions when reading outputs of a task (e.g., the .exitcode file in Azure blob storage).

Steps to reproduce the problem

As this is an issue that occurs mainly with large-scale analysis of genomes within Azure, and does not happen every time, it is a bit more difficult to provide a specific set of steps to reproduce the issue. However, here is a rough sketch:

  1. Use pipeline: https://github.com/phac-nml/mikrokondo (version 0.2.1)
  2. Run multiple instances of this pipeline using Azure batch, perhaps 10 to 20, one instance per microbial genome (whole-genome sequences).
    1. The rate I have observed is ~10% of the runs will fail with the below exception.
    2. It is important to use real-world sized genomes (e.g., from NCBI). The specific data I am testing with is a selection of Illumina paired-end sequence reads of Salmonella genomes from this BioProject PRJNA305824.
  3. Monitor runs. Roughly %10 will fail with the below timeout exceptions. May take up to 2 hours or more for failure. I have attempted to test out on smaller datasets that run quickly, but it's much harder to reproduce failures in that case.

Program output

I am unfortunately unable to share the full nextflow.log, but here is the relevant section.

Jun... [reactor-http-epoll-2] WARN  r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
	at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:202)
	at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:187)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:840)
Jun... [Task monitor] DEBUG n.c.azure.batch.AzBatchTaskHandler - [AZURE BATCH] Cannot read exit status for task: `MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...)` | java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
Jun... [pool-11-thread-1] INFO  c.m.a.adal4j.AuthenticationAuthority - [Correlation ID: ...] Instance discovery was successful
Jun... [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 18; name: MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...); status: COMPLETED; exit: -; error: -; workDir: az://...]
Jun... [TaskFinalizer-10] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...); work-dir=az://...
  error [nextflow.exception.ProcessFailedException]: Process `MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...)` terminated for an unknown reason -- Likely it has been terminated by the external system

What I have observed is that the below exception prints out 3 times, and then on the fourth time the task fails:

Jun... [reactor-http-epoll-2] WARN  r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
	at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:202)
	at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:187)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:840)

Note that the default timeout is 60,000 milliseconds (1 minute), but we modified it be 600,000 milliseconds (10 minutes) by adjusting the AZURE_REQUEST_RESPONSE_TIMEOUT environment variable as described here: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/core/azure-core/README.md#http-timeouts. This didn't fix the problem though, just increased the time before failure.

Environment

  • Nextflow version: 24.04.2 build 5914
  • Java version: Groovy 4.0.21 on OpenJDK 64-Bit Server VM 17.0.10+7-LTS
  • Operating system: Linux 5.10.102.2-microsoft-standard
  • Bash version: [Unsure. I will update when I can get the bash version]

Additional context

We have spent a bit of time trying to identify the cause of this issue, but haven't been able to isolate and address it. However, we have some additional information on the locations in code where this is happening. I'm providing this information in case it is useful, but please feel free to skip the below (I might be wrong in some of this too).

1. TimeoutException failing a task

The TimeoutException that finally triggers failing a task is logged in the following line:

Jun... [Task monitor] DEBUG n.c.azure.batch.AzBatchTaskHandler - [AZURE BATCH] Cannot read exit status for task: `MIKROKONDO:POST_ASSEMBLY:QC_ASSEMBLY:QUAST (...)` | java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.

This occurs in this section of Nextflow code (the AzBatchTaskHandler):

https://github.com/nextflow-io/nextflow/blob/e6a5e17db91ca6fae875df42053bd8c1b2a23bd4/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchTaskHandler.groovy#L170-L178

That is, it's attempting to read the exit status via the exitFile.text property. The exitFile is of type AzPath, which is of type Path.

2. Reading from Path.text in Azure

I am not as familiar with how Groovy ultimately handles Path.text for reading and returning the contents of a file as text, but in Nextflow with Azure I think this ultimately runs this bit of code in AzFileSystem to open up a stream to the file in blob storage:

https://github.com/nextflow-io/nextflow/blob/e6a5e17db91ca6fae875df42053bd8c1b2a23bd4/plugins/nf-azure/src/main/nextflow/cloud/azure/nio/AzFileSystem.groovy#L201-L217

3. BlobClient.openInputStream()

Line 208 above runs client.openInputStream(). The client object is of type BlobClient, which extends from BlobClientBase and the openInputStream() is defined here in the Azure Java SDK:

    /**
     * Opens a blob input stream to download the blob.
     *
     * @return An <code>InputStream</code> object that represents the stream to use for reading from the blob.
     * @throws BlobStorageException If a storage service error occurred.
     */
    public BlobInputStream openInputStream() {
        return openInputStream((BlobRange) null, null);
    }

https://github.com/Azure/azure-sdk-for-java/blob/421555531b3e83a5df3ca605653c46f8c9c7d6de/sdk/storage/azure-storage-blob/src/main/java/com/azure/storage/blob/specialized/BlobClientBase.java#L292-L300

4. Azure Java SDK HTTP Pipeline

The Azure Java SDK seems to use an HTTP Pipeline for defining different steps to take when handling an API request to Azure: https://learn.microsoft.com/en-us/azure/developer/java/sdk/http-client-pipeline#http-pipeline

In particular, there exists configurable RetryPolicies that are part of the Azure SDK: https://learn.microsoft.com/en-us/azure/developer/java/sdk/http-client-pipeline#common-http-pipeline-policies

5. Azure/HTTP Pipeline retry policies

This brings us to the other part of the exception in the Nextflow log files:

Jun... [reactor-http-epoll-2] WARN  r.n.http.client.HttpClientConnect - [..., L:... - R:...] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 600000 milliseconds.
	at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:202)
	at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:187)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:840)

This appears to be triggered by this code:

    void responseTimedOut(ChannelHandlerContext ctx) {
        if (!closed) {
            disposeResponseTimeoutWatcher();
            ctx.fireExceptionCaught(
                new TimeoutException("Channel response timed out after " + responseTimeoutMillis + " milliseconds."));
            ctx.close();
            closed = true;
        }
    }

https://github.com/Azure/azure-sdk-for-java/blob/421555531b3e83a5df3ca605653c46f8c9c7d6de/sdk/core/azure-core-http-netty/src/main/java/com/azure/core/http/netty/implementation/AzureSdkHandler.java#L199-L207

6. Nextflow retry policies

I do know there are azure.retryPolicy.* parameters that can be adjusted in the Nextflow config: https://www.nextflow.io/docs/latest/config.html#config-azure. We have tried adjusting them, but I'm guessing in this case when reading the .exitcode file from Azure blob storage, these policies aren't being applied, and so it's defaulting to the retry policies that are configured as default within the Azure SDK. I'm not sure if this is expected to cause any issues?

Also, some of the above description of the code might be incorrect. I have been running up against my own lack of knowledge of how everything works, which makes it hard to debug.

apetkau avatar Jun 14 '24 17:06 apetkau