amazon-genomics-cli icon indicating copy to clipboard operation
amazon-genomics-cli copied to clipboard

Workflow running out of memory

Open spitfiredd opened this issue 1 year ago • 5 comments

Describe the Bug

Worker processes not spawning with enough memory or scaling; therefore Nexflow will error with exit status 137 (not enough memory)

Steps to Reproduce

name: foo
schemaVersion: 1
workflows:
  foo:
    type:
      language: nextflow
      version: dsl2
    sourceURL: workflows/foo
contexts:
  dev:
    instanceTypes:
      - "r5.large"
    engines:
      - type: nextflow
        engine: nextflow

Child processes are spawing with 1vCPU and 1024 MEMORY

Relevant Logs

Main Process

2022-11-17T14:00:01.866-08:00	Version: 22.04.3 build 5703
2022-11-17T14:00:01.866-08:00	Created: 18-05-2022 19:22 UTC
2022-11-17T14:00:01.866-08:00	System: Linux 4.14.294-220.533.amzn2.x86_64
2022-11-17T14:00:01.866-08:00	Runtime: Groovy 3.0.10 on OpenJDK 64-Bit Server VM 11.0.16.1+9-LTS
2022-11-17T14:00:01.866-08:00	Encoding: UTF-8 (ANSI_X3.4-1968)
2022-11-17T14:00:01.866-08:00	Process: [email protected] [redacted]
2022-11-17T14:00:01.866-08:00	CPUs: 2 - Mem: 2 GB (1.5 GB) - Swap: 2 GB (2 GB)
2022-11-17T14:00:01.866-08:00	Nov-17 21:53:57.780 [main] WARN com.amazonaws.util.Base64 - JAXB is unavailable. Will fallback to SDK implementation which may be less performant.If you are using Java 9+, you will need to include javax.xml.bind:jaxb-api as a dependency.
2022-11-17T14:00:01.866-08:00	Nov-17 21:53:57.799 [main] DEBUG nextflow.file.FileHelper - Can't check if specified path is NFS (1): redacted
2022-11-17T14:00:01.866-08:00	Nov-17 21:53:57.799 [main] DEBUG nextflow.Session - Work-dir: redacted
2022-11-17T14:00:01.866-08:00	Nov-17 21:53:57.799 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /root/.nextflow/assets/redacted/bin
2022-11-17T14:00:01.866-08:00	Nov-17 21:53:57.871 [main] DEBUG nextflow.executor.ExecutorFactory - Extension executors providers=[AwsBatchExecutor]
2022-11-17T14:00:01.866-08:00	Nov-17 21:53:57.886 [main] DEBUG nextflow.Session - Observer factory: DefaultObserverFactory
2022-11-17T14:00:01.866-08:00	Nov-17 21:53:57.954 [main] DEBUG nextflow.cache.CacheFactory - Using Nextflow cache factory: nextflow.cache.DefaultCacheFactory
2022-11-17T14:00:01.866-08:00	Nov-17 21:53:57.975 [main] DEBUG nextflow.util.CustomThreadPool - Creating default thread pool > poolSize: 3; maxThreads: 1000
2022-11-17T14:00:01.866-08:00	Nov-17 21:53:58.123 [main] DEBUG nextflow.Session - Session start invoked
2022-11-17T14:00:01.866-08:00	Nov-17 21:53:59.049 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution

Child Process

2022-11-17T14:00:01.867-08:00	Essential container in task exited - OutOfMemoryError: Container killed due to memory usage
2022-11-17T14:00:01.867-08:00	Command executed:
2022-11-17T14:00:01.867-08:00	fastp     -i USDA_soil_C35-5-1_1.fastq.gz     -I USDA_soil_C35-5-1_2.fastq.gz     -o "USDA_soil_C35-5-1.trim.R1.fq.gz"     -O "USDA_soil_C35-5-1.trim.R2.fq.gz"     --length_required 50     -h "USDA_soil_C35-5-1.html"     -w 16
2022-11-17T14:00:01.867-08:00	Command exit status:
2022-11-17T14:00:01.867-08:00	137
2022-11-17T14:00:01.867-08:00	Command output:
2022-11-17T14:00:01.867-08:00	(empty)
2022-11-17T14:00:01.867-08:00	Command error:
2022-11-17T14:00:01.867-08:00	  .command.sh: line 2:   188 Killed                  fastp -i USDA_soil_C35-5-1_1.fastq.gz -I USDA_soil_C35-5-1_2.fastq.gz -o "USDA_soil_C35-5-1.trim.R1.fq.gz" -O "USDA_soil_C35-5-1.trim.R2.fq.gz" --length_required 50 -h "USDA_soil_C35-5-1.html" -w 16

Expected Behavior

spawn processes with enough memory or scale.

Actual Behavior

Container ran out of memory

Screenshots

Additional Context

ran workflow with the following command: agc workflow run foo --context dev

Operating System: Linux AGC Version: 1.5.1 Was AGC setup with a custom bucket: no Was AGC setup with a custom VPC: no

spitfiredd avatar Nov 17 '22 22:11 spitfiredd

I am seeing a similar behavior with cromwell. I give a task 64GB. In AWS batch, I see the following warning next to the Memory information

Configuration conflict
This value was submitted using containerOverrides.memory which has been deprecated and was not used as an override. Instead, the MEMORY value found in the job definition’s resourceRequirements key was used instead. More information about the deprecated key can be found in the AWS Batch API documentation.

I see an "Essential container in task exited". However, when I click on the job definition. It appears to have 8GB allocated memory. Is there a different way to specify memory?

biofilos avatar Nov 28 '22 07:11 biofilos

Thanks for reporting this issue. Is this an issue with the 1.5.2 release as well?

vvalleru avatar Nov 28 '22 19:11 vvalleru

It is still an issue with v 1.5.2 (cromwell)

biofilos avatar Dec 01 '22 11:12 biofilos

@spitfiredd The child processes are spawned with a default of 1vCPU and 1024 MEMORY. If tasks need more memory or CPU then you would typically make these requests as process directives for CPU and memory. (https://www.nextflow.io/docs/latest/process.html#cpus) and (https://www.nextflow.io/docs/latest/process.html#memory).

markjschreiber avatar Dec 07 '22 15:12 markjschreiber

@biofilos AGC is currently using an older version of Cromwell. This older version uses the deprecated call to AWS Batch, hence the error. In our next release we will update the version of Cromwell used.

As a possible work around, you might consider deploying a miniwdl context to run the WDL.

markjschreiber avatar Dec 07 '22 15:12 markjschreiber