genome_updater icon indicating copy to clipboard operation
genome_updater copied to clipboard

genome_updater and nextflow seems to work only partially

Open oliverdrechsel opened this issue 3 years ago • 5 comments

Dear @pirovc ,

we'd like to run genome_updater in a nextflow pipeline context. After a lot of tries we're a bit stuck.

We generated a conda yaml to setup the conda environment on runtime.

name: GenomeUpdaterEnv
channels:
  - bioconda
  - conda-forge
dependencies:
  - genome_updater=0.2.5

Downloading the first genome works as expected, but all the subsequent downloads fail with a bash error.

N E X T F L O W ~ version 21.04.0
Launching `genome_updater.nf` [prickly_becquerel] - revision: 06262faebb
executor > local (1)
[86/d475e5] process > genomeUpdater [ 0%] 0 of 1
Error executing process > 'genomeUpdater'

Caused by:
Process `genomeUpdater` terminated with an error exit status (55)

Command executed:

genome_updater.sh -d "refseq" -g "taxids:632" -c "all" -l "Complete Genome" -o "632_test" -f "genomic.fna.gz" -t 12

Command exit status:
55

Command output:
-------------------------------------------
┌─┐┌─┐┌┐┌┌─┐┌┬┐┌─┐ ┬ ┬┌─┐┌┬┐┌─┐┌┬┐┌─┐┬─┐
│ ┬├┤ ││││ ││││├┤ │ │├─┘ ││├─┤ │ ├┤ ├┬┘
└─┘└─┘┘└┘└─┘┴ ┴└─┘────└─┘┴ ─┴┘┴ ┴ ┴ └─┘┴└─
v0.2.5
-------------------------------------------
Mode: NEW - DOWNLOAD
Working directory: 632_test/
-------------------------------------------
Downloading assembly summary [2021-12-14_10-04-20]
- 548 entries available
- 492 entries removed with filters: RefSeq category=all, Assembly level=Complete Genome, Version status=latest, valid URLs
- 56 entries to be downloaded
- Downloading 56 files with 12 threads
1/56 - 1.78%

Command error:
/usr/bin/bash: download: command not found
/usr/bin/bash: download: command not found
/usr/bin/bash: download: command not found
/usr/bin/bash: download: command not found
/usr/bin/bash: download: command not found
/usr/bin/bash: download: command not found
/usr/bin/bash: download: command not found
/usr/bin/bash: download: command not found
/usr/bin/bash: download: command not found
/usr/bin/bash: download: command not found
/usr/bin/bash: download: command not found
/usr/bin/bash: download: command not found
/usr/bin/bash: download: command not found
/usr/bin/bash: download: command not found
/usr/bin/bash: download: command not found

If the command is repeated on a normal bash terminal, the comand runs perfectly fine.

Could you please point us to where we cause the error?

cheers Oli

oliverdrechsel avatar Dec 14 '21 09:12 oliverdrechsel

Hi @oliverdrechsel, that's quite awkward. download is the name of a function inside the script and somehow the reference is being lost, maybe something related the way the script is called? Do you know exactly what what is the command nextflow uses to call it? I have no experience with it, but we can try to figure it out. Could you try to run genome_updater.sh -D inside nextflow and paste the output here?

pirovc avatar Dec 14 '21 13:12 pirovc

Hi @pirovc ,

thanks for picking up the issue so fast. The nextflow script runs genome_updater as an isolated process. It just relies on the yaml mentioned earlier. The actual command that is executed is written in the output above

genome_updater.sh -d "refseq" -g "taxids:632" -c "all" -l "Complete Genome" -o "632_test" -f "genomic.fna.gz" -t 12

If we copy this command line to bash, it runs perfectly fine. Nextflow initiates a new subshell with #! /usr/bin/bash -ue. Do you think this may lead to issues?

We installed nextflow from conda so it should be the most recent version.

// Taxonomy ID's:
gentaxid = "632"
genome_ch = Channel.value(gentaxid)


process genomeUpdater{

    conda "genomeUpdaterEnv.yaml"


    input:
        val x from genome_ch

    output:
       path("*.fna.gz") into fasta_ch


    script:

        """
        
    	genome_updater.sh \
            -d "refseq" \
            -g "taxids:${x}" \
            -c "all" \
            -l "Complete Genome" \
            -o "${x}_test" \
            -f "genomic.fna.gz" \
            -t 12 
       
        """

}

fasta_ch.view()

oliverdrechsel avatar Dec 14 '21 14:12 oliverdrechsel

I found an issue very similar to what is happening with you (parallel calling an function returning "command not found") https://stackoverflow.com/a/27833407

The problem may be multiple bash versions which are incompatible with each other, You said that nextflow calls #! /usr/bin/bash and genome_updater.sh is using #!/bin/bash and I'm not sure which one parallel is calling. Maybe try to set: SHELL=/bin/bash genome_updater.sh -d ... in your script call.

pirovc avatar Dec 14 '21 14:12 pirovc

You can also try to use the newest release (v0.3.0) where the shebang is now #!/usr/bin/env bash which should be more generalized

pirovc avatar Jan 21 '22 17:01 pirovc

Hi, thanks for your update. We checked now multiple things. Amongst others we ran the tool in snakemake using as much as possible the same env files and script. It worked fine. We found that NextFlow seems to miss on something while building the Conda environment. We need to find out what it is and then account for that.

We'll keep you posted as soon as we found out what it is.

oliverdrechsel avatar Jan 27 '22 07:01 oliverdrechsel