cromwell icon indicating copy to clipboard operation
cromwell copied to clipboard

Failed to evaluate job outputs - IOException: Could not read from s3...

Open doron-st opened this issue 5 years ago • 23 comments

While testing cromwell-36 with AWS batch I was able to reproduce this error:

2019-02-25 09:38:52,508 cromwell-system-akka.dispatchers.engine-dispatcher-24 ERROR - WorkflowManagerActor Workflow b6b9322c-3929-4b72-9598-45d97dfb858d failed (during ExecutingWorkflowState): cromwell.backend.standard.StandardAsyncExecutionActor$$anon$2: Failed to evaluate job outputs:
Bad output 'print_nach_nachman_meuman.out': [Attempted 1 time(s)] - IOException: Could not read from s3://nrglab-cromwell-genomics/cromwell-execution/run_multiple_tests/b6b9322c-3929-4b72-9598-45d97dfb858d/call-test_cromwell_on_aws/shard-61/SingleTest.test_cromwell_on_aws/f8ecf673-ed61-4b06-b1d6-c20f7efe986e/call-print_nach_nachman_meuman/print_nach_nachman_meuman-stdout.log: Cannot access file: s3://s3.amazonaws.com/nrglab-cromwell-genomics/cromwell-execution/run_multiple_tests/b6b9322c-3929-4b72-9598-45d97dfb858d/call-test_cromwell_on_aws/shard-61/SingleTest.test_cromwell_on_aws/f8ecf673-ed61-4b06-b1d6-c20f7efe986e/call-print_nach_nachman_meuman/print_nach_nachman_meuman-stdout.log
        at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$handleExecutionSuccess$1(StandardAsyncExecutionActor.scala:867)

The error occurs when running many sub-workflows within a single wrapping workflow. The environment is configured correctly, and the test usually passes when running <30 subworkflows.

Here are the workflows:

run_multiple_test.wdl

import "three_task_sequence.wdl" as SingleTest

workflow run_multiple_tests {
    scatter (i in range(30)){
        call SingleTest.three_task_sequence{}
    }
}

three_task_sequence.wdl

workflow three_task_sequence{
    call print_nach

    call print_nach_nachman {
        input:
            previous = print_nach.out
    }

    call print_nach_nachman_meuman{
        input:
                previous = print_nach_nachman.out
    }
    output{
        Array[String] out = print_nach_nachman_meuman.out
    }
}

task print_nach{
     command{
         echo "nach"
     }
     output{
         Array[String] out = read_lines(stdout())
     }
     runtime {
	    docker: "ubuntu:latest"
	    maxRetries: 3
     }
 }

 task print_nach_nachman{
    Array[String] previous

     command{
         echo ${sep=' ' previous} " nachman"
     }
     output{
         Array[String] out = read_lines(stdout())
     }
     runtime {
        docker: "ubuntu:latest"
        maxRetries: 3
     }
     
 }

 task print_nach_nachman_meuman{
     Array[String] previous

      command{
        echo ${sep=' ' previous} " meuman"
      }
      output{
        Array[String] out = read_lines(stdout())
      }
      runtime {
        docker: "ubuntu:latest"
        maxRetries: 3
      }
  }

Here is the cromwell-conf:

// aws.conf
include required(classpath("application"))

webservice {
  port = 8001
  interface = 0.0.0.0
}

aws {
  application-name = "cromwell"
  auths = [{
      name = "default"
      scheme = "default"
  }]
  region = "us-east-1"
}

engine {
  filesystems {
    s3 { auth = "default" }
  }
}

backend {
  default = "AWSBATCH"
  providers {
    AWSBATCH {
      actor-factory = "cromwell.backend.impl.aws.AwsBatchBackendLifecycleActorFactory"
      config {
        root = "s3://nrglab-cromwell-genomics/cromwell-execution"
        auth = "default"

        numSubmitAttempts = 3
        numCreateDefinitionAttempts = 3

        concurrent-job-limit = 100

        default-runtime-attributes {
          queueArn: "arn:aws:batch:us-east-1:66:job-queue/GenomicsDefaultQueue"
        }

        filesystems {
          s3 {
            auth = "default"
          }
        }
      }
    }
  }
}

system {
  job-rate-control {
    jobs = 1
    per = 1 second
  }
}

Would appreciate help on this. I wonder if cromwell was ever tested for many parallel sub-workflows running on AWS?

Thanks!

doron-st avatar Feb 28 '19 13:02 doron-st

Hey, did you ever manage to get a workaround for this error?

caaespin avatar Jul 25 '19 07:07 caaespin

@caaespin I'm assuming that means you still see this. Are you using a recent Cromwell version? (42+)

geoffjentry avatar Jul 25 '19 07:07 geoffjentry

@geoffjentry yes. My current deployment is v42.

If you have access to the GATK forums, i put more details in my post there: https://gatkforums.broadinstitute.org/wdl/discussion/24268/aws-batch-randomly-fails-when-running-multiple-workflows/p1?new=1

caaespin avatar Jul 25 '19 07:07 caaespin

One up. I have similar error

marpiech avatar Aug 01 '19 08:08 marpiech

@geoffjentry from inspecting logs and AWS Batch console, i think what is happening is that the jobs fail because Cromwell shutdowns the VMs earlier than expected. So one of shard hasn't finished and is unable to upload to S3, hence the problem here occurs. Anyways this is a hypothesis based on what I saw, hopefully is helpful.

caaespin avatar Aug 18 '19 18:08 caaespin

@geoffjentry Any movement on this? I'm having this same issue sporadically (v48 + AWS backend) with workflows that contain large scatter operations.

alexwaldrop avatar Feb 04 '20 21:02 alexwaldrop

@alexwaldrop NB that I don't work there anymore and sadly haven't had the energy to actively contribute. Perhaps @aednichols can chime in

geoffjentry avatar Feb 04 '20 22:02 geoffjentry

I am having the same error with the example "Using Data on S3" on https://docs.opendata.aws/genomics-workflows/orchestration/cromwell/cromwell-examples/ . I have changed the S3 bucket name in the .json file to my bucket name, but the run still failed. After reporting running failure, I have got the same error message. I am using cromwell-48. The S3 bucket has all public access, and I was logged in as the Admin in two terminal windows, one running the server and the other submitting the job. The previous two hello-world example were successful. There is no log file in the bucket and in the cromwell-execution, the only file create was the script. There is no rc or stderr or stdout created.

blindmouse avatar Mar 11 '20 20:03 blindmouse

I am having the same error with the example "Using Data on S3" on https://docs.opendata.aws/genomics-workflows/orchestration/cromwell/cromwell-examples/ . I have changed the S3 bucket name in the .json file to my bucket name, but the run still failed. After reporting running failure, I have got the same error message. I am using cromwell-48. The S3 bucket has all public access, and I was logged in as the Admin in two terminal windows, one running the server and the other submitting the job. The previous two hello-world example were successful. There is no log file in the bucket and in the cromwell-execution, the only file create was the script. There is no rc or stderr or stdout created.

@blindmouse Were you able to resolve your issue? I am encountering the same problem. Thanks.

sripaladugu avatar Jul 21 '20 20:07 sripaladugu

This can happen if the job fails meaning that an rc.txt file isn’t created. It would be worth looking at the CloudWatch log for the batch job.

On Tue, Jul 21, 2020 at 4:07 PM Sri Paladugu [email protected] wrote:

Is there any progress on this issue? I am the getting the following exception: IOException: Could not read from s3:///results/ReadFile/5fec5c4a-2e3f-49ed-8f9e-6d9d2d759449/call-read_file/read_file-rc.txt Caused by: java.nio.file.NoSuchFileException: s3:// s3.amazonaws.com/s3bucketname/results/ReadFile/5fec5c4a-2e3f-49ed-8f9e-6d9d2d759449/call-read_file/read_file-rc.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cromwell/issues/4687#issuecomment-662079379, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EMJZ66Z5PIAEUX3IBLR4XYPZANCNFSM4G23FFUQ .

markjschreiber avatar Jul 21 '20 23:07 markjschreiber

This can happen if the job fails meaning that an rc.txt file isn’t created. It would be worth looking at the CloudWatch log for the batch job. On Tue, Jul 21, 2020 at 4:07 PM Sri Paladugu @.***> wrote: Is there any progress on this issue? I am the getting the following exception: IOException: Could not read from s3:///results/ReadFile/5fec5c4a-2e3f-49ed-8f9e-6d9d2d759449/call-read_file/read_file-rc.txt Caused by: java.nio.file.NoSuchFileException: s3:// s3.amazonaws.com/s3bucketname/results/ReadFile/5fec5c4a-2e3f-49ed-8f9e-6d9d2d759449/call-read_file/read_file-rc.txt — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4687 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EMJZ66Z5PIAEUX3IBLR4XYPZANCNFSM4G23FFUQ .

Cloudwatch logs contained the following message: "/bin/bash: /var/scratch/fetch_and_run.sh: Is a directory"

sripaladugu avatar Jul 22 '20 00:07 sripaladugu

It may be that you’re running Cromwell 52 or later with an older AWS CloudFormation built infrastructure. Can you share which build of Cromwell you’re using and the build/ version/ origin of the CloudFormation template?

On Tue, Jul 21, 2020 at 8:18 PM Sri Paladugu [email protected] wrote:

This can happen if the job fails meaning that an rc.txt file isn’t created. It would be worth looking at the CloudWatch log for the batch job. … <#m_-7712250081708699723_> On Tue, Jul 21, 2020 at 4:07 PM Sri Paladugu @.***> wrote: Is there any progress on this issue? I am the getting the following exception: IOException: Could not read from s3:///results/ReadFile/5fec5c4a-2e3f-49ed-8f9e-6d9d2d759449/call-read_file/read_file-rc.txt Caused by: java.nio.file.NoSuchFileException: s3:// s3.amazonaws.com/s3bucketname/results/ReadFile/5fec5c4a-2e3f-49ed-8f9e-6d9d2d759449/call-read_file/read_file-rc.txt — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4687 (comment) https://github.com/broadinstitute/cromwell/issues/4687#issuecomment-662079379>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EMJZ66Z5PIAEUX3IBLR4XYPZANCNFSM4G23FFUQ .

Cloudwatch logs contained the following message: "/bin/bash: /var/scratch/fetch_and_run.sh: Is a directory"

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cromwell/issues/4687#issuecomment-662170952, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6ENOHHXQP6VC5XUGZ5TR4YV5XANCNFSM4G23FFUQ .

markjschreiber avatar Aug 08 '20 22:08 markjschreiber

Hi @markjschreiber I'm also running into this error. I am using cromwell 53 with a custom cdk stack based on the CloudFormation infrastructure described here: https://docs.opendata.aws/genomics-workflows/

Are modifications needed for compatibility with newer versions of Cromwell? Are these documented somewhere?

mderan-da avatar Sep 09 '20 13:09 mderan-da

Attached is some documentation that works for v52 and should work for v53

On Wed, Sep 9, 2020 at 9:20 AM mderan-da [email protected] wrote:

Hi @markjschreiber https://github.com/markjschreiber I'm also running into this error. I am using cromwell 53 with a custom cdk stack based on the CloudFormation infrastructure described here: https://docs.opendata.aws/genomics-workflows/

Are modifications needed for compatibility with newer versions of Cromwell? Are these documented somewhere?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cromwell/issues/4687#issuecomment-689558662, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EO6WEE4BYYPTX4HZ2LSE56JXANCNFSM4G23FFUQ .

markjschreiber avatar Sep 11 '20 21:09 markjschreiber

Hi @markjschreiber Thanks but it looks like the attachment didn't come through.

mderan-da avatar Sep 12 '20 19:09 mderan-da

@markjschreiber running into the same error for both v52 and v53.1. I am using the same CloudFormation @mderan-da mentioned . Appreciate your newer documentation on this.

yaomin avatar Sep 13 '20 20:09 yaomin

Documentation can be downloaded from here https://cromwell-share-ad485.s3.us-east-2.amazonaws.com/InstallingGenomicsWorkflowCoreWithCromwel52.pdf

On Sun, Sep 13, 2020 at 4:48 PM Yaomin Xu [email protected] wrote:

@markjschreiber https://github.com/markjschreiber running into the same error for both v52 and v53.1. I am using the same CloudFormation @mderan-da https://github.com/mderan-da mentioned . Appreciate your newer documentation on this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cromwell/issues/4687#issuecomment-691723254, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2E6EKCM56WST3J6NO5CS3SFUVYLANCNFSM4G23FFUQ .

markjschreiber avatar Sep 14 '20 14:09 markjschreiber

Cloudwatch logs contained the following message: "/bin/bash: /var/scratch/fetch_and_run.sh: Is a directory"

Also have this error. Anyone figure out what the issue is?

dfeinzeig avatar Nov 14 '20 18:11 dfeinzeig

Also have this error, using Cromwell 52, installed using this manual :

https://aws-genomics-workflows.s3.amazonaws.com/Installing+the+Genomics+Workflow+Core+and+Cromwell.pdf

logs say : fetch_and_run.is is a directory.

geertvandeweyer avatar Dec 17 '20 17:12 geertvandeweyer

Also have this error, using Cromwell 52, installed using this manual :

https://aws-genomics-workflows.s3.amazonaws.com/Installing+the+Genomics+Workflow+Core+and+Cromwell.pdf

logs say : fetch_and_run.is is a directory.

Extra info : cloning job & resubmitting through aws console runs fine. so it seems to be a temporary issue

geertvandeweyer avatar Dec 17 '20 18:12 geertvandeweyer

Hmmm, still stuck on this - any updates from your guys' end? I tried cloning and resubmitting, still getting the same error.

sscho avatar May 13 '21 22:05 sscho

Still getting this error today.

ptdtan avatar Jun 08 '21 11:06 ptdtan

I'm getting this error almost certainly when I run workflows where more samples (e.g. 96) than usual are scattered. Cromwell version: 60-6048d0e-SNAP.

Is there a workaround to this?

alimayy avatar Sep 12 '22 22:09 alimayy