rnaseq icon indicating copy to clipboard operation
rnaseq copied to clipboard

Use eval output for tool versions

Open bentsherman opened this issue 2 years ago • 18 comments

This PR uses the experimental cmd output type in https://github.com/nextflow-io/nextflow/pull/4493 to simplify the collection of tool versions.

Once the topic channel support is merged into Nextflow, we can merge this PR with #1109 to simplify things further. Instead of emitting versions1, versions2, etc for processes with multiple tools, we can simply send them all to the 'versions' topic.

PR checklist

  • [ ] This comment contains a description of changes (with reason).
  • [ ] Make sure your code lints (nf-core lint).
  • [ ] Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • [ ] Usage Documentation in docs/usage.md is updated.
  • [ ] Output Documentation in docs/output.md is updated.
  • [ ] CHANGELOG.md is updated.
  • [ ] README.md is updated (including new tool citations and authors/contributors).

bentsherman avatar Nov 15 '23 22:11 bentsherman

For some reason the SALMON_QUANT process is failing:

Caused by:
  Process `NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:SALMON_QUANT (RAP1_UNINDUCED_REP2)` terminated with an error exit status (1)

Command executed:

  salmon quant \
      --geneMap genome_gfp.gtf \
      --threads 2 \
      --libType=SR \
      --index salmon \
      -r RAP1_UNINDUCED_REP2_primary.fastq.gz \
       \
      -o RAP1_UNINDUCED_REP2
  
  if [ -f RAP1_UNINDUCED_REP2/aux_info/meta_info.json ]; then
      cp RAP1_UNINDUCED_REP2/aux_info/meta_info.json "RAP1_UNINDUCED_REP2_meta_info.json"
  fi

Command exit status:
  1

Command output:
  (empty)

Command error:
Version Info: This is the most recent version of salmon.
### salmon (selective-alignment-based) v1.10.1
### [ program ] => salmon 
### [ command ] => quant 
### [ geneMap ] => { genome_gfp.gtf }
### [ threads ] => { 2 }
### [ libType ] => { SR }
### [ index ] => { salmon }
### [ unmatedReads ] => { RAP1_UNINDUCED_REP2_primary.fastq.gz }
### [ output ] => { RAP1_UNINDUCED_REP2 }
Logs will be written to RAP1_UNINDUCED_REP2/logs
[2023-11-15 06:41:55.947] [jointLog] [info] setting maxHashResizeThreads to 2
-----------------------------------------
| Loading contig table | Time = 63.579 us
-----------------------------------------
size = 840
[2023-11-15 06:41:55.947] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[2023-11-15 06:41:55.947] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2023-11-15 06:41:55.947] [jointLog] [info] Setting consensusSlack to selective-alignment default of 0.35.
[2023-11-15 06:41:55.947] [jointLog] [info] parsing read library format
[2023-11-15 06:41:55.947] [jointLog] [info] There is 1 library.
[2023-11-15 06:41:55.947] [jointLog] [info] Loading pufferfish index
[2023-11-15 06:41:55.947] [jointLog] [info] Loading dense pufferfish index.
-----------------------------------------
| Loading contig offsets | Time = 1.6262 ms
-----------------------------------------
-----------------------------------------
| Loading reference lengths | Time = 3.809 us
-----------------------------------------
-----------------------------------------
| Loading mphf table | Time = 77.721 us
-----------------------------------------
size = 247935
Number of ones: 839
Number of ones per inventory item: 512
Inventory entries filled: 2
-----------------------------------------
| Loading contig boundaries | Time = 478.93 us
-----------------------------------------
size = 247935
-----------------------------------------
| Loading sequence | Time = 95.192 us
-----------------------------------------
size = 222765
-----------------------------------------
| Loading positions | Time = 346.52 us
-----------------------------------------
size = 381211
-----------------------------------------
| Loading reference sequence | Time = 86.075 us
-----------------------------------------
-----------------------------------------
| Loading reference accumulative lengths | Time = 2.828 us
-----------------------------------------
[2023-11-15 06:41:55.950] [jointLog] [info] done
[2023-11-15 06:41:56.000] [jointLog] [info] Index contained 126 targets
[2023-11-15 06:41:56.000] [jointLog] [info] Number of decoys : 1
[2023-11-15 06:41:56.000] [jointLog] [info] First decoy index : 125 

Error: no valid ID found for GFF record

Maybe dev is broken?

bentsherman avatar Nov 15 '23 22:11 bentsherman

Fetching upstream fixed it 👍

bentsherman avatar Nov 15 '23 22:11 bentsherman

nf-core lint overall result: Passed :white_check_mark: :warning:

Posted for pipeline commit e68f451

+| ✅ 144 tests passed       |+
#| ❔   6 tests were ignored |#
!| ❗   5 tests had warnings |!

:heavy_exclamation_mark: Test warnings:

  • files_exist - File not found: .github/workflows/awstest.yml
  • files_exist - File not found: .github/workflows/awsfulltest.yml
  • nextflow_config - Config manifest.version should end in dev: 3.13.0
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in WorkflowRnaseq.groovy: Optionally add in-text citation tools to this list.

:grey_question: Tests ignored:

  • files_unchanged - File ignored due to lint config: assets/email_template.html
  • files_unchanged - File ignored due to lint config: assets/email_template.txt
  • files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy
  • files_unchanged - File ignored due to lint config: .gitignore or .prettierignore or pyproject.toml
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/rnaseq/rnaseq/.github/workflows/awstest.yml
  • multiqc_config - multiqc_config

:white_check_mark: Tests passed:

Run details

  • nf-core/tools version 2.10
  • Run at 2023-11-16 01:26:58

github-actions[bot] avatar Nov 15 '23 22:11 github-actions[bot]

@drpatelh @ewels now that Nextflow has channel topics, it occurred to me that we could actually simplify a lot by just using env outputs. See my comment here, but I will copy the example code to illustrate my point:

// current nf-core convention
process FOOBAR {
    output:
    path 'versions.yml', topic: versions

    """
    # ...

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        foo: \$(foo --version)
        bar: \$(bar --version)
    END_VERSIONS
    """
}

// env output
process FOOBAR {
    output:
    tuple val("${task.process}"), val('foo'), env(FOO_VERSION), topic: versions
    tuple val("${task.process}"), val('bar'), env(BAR_VERSION), topic: versions

    """
    # ...

    FOO_VERSION=\$(foo --version)
    BAR_VERSION=\$(bar --version)
    """
}

// cmd output
process FOOBAR {
    output:
    tuple val("${task.process}"), val('foo'), cmd('foo --version'), topic: versions
    tuple val("${task.process}"), val('bar'), cmd('bar --version'), topic: versions

    """
    # ...
    """
}

I would love to hear what you guys think about (2) vs (3). Keep in mind that in all three cases, the tool version commands are executed in the task script in more or less the same way.

bentsherman avatar Nov 27 '23 22:11 bentsherman

My preference is for option 3 - the new cmd style. I like keeping the version commands out of the script, as it makes the script commands much cleaner and easier to read.

NB: The versions1/versions2 stuff in the PR code diff can be simplified after https://github.com/nf-core/rnaseq/pull/1109 is merged. This new syntax is shown in Ben's comment.

ewels avatar Dec 11 '23 14:12 ewels

My preference goes to version2, which I find more explicit, easier to read, but I do love the version3 that removes completely the version generation from the script itself.

maxulysse avatar Dec 11 '23 15:12 maxulysse

I like three, my only concern is some of the commands to get the version get pretty long. In theory, we could do something like:

def foo_version = 'foo --version'

output:
    tuple val("${task.process}"), val('foo'), cmd("${foo_version}"), topic: versions
    

edmundmiller avatar Dec 11 '23 15:12 edmundmiller

I like option 3! But I wonder about modules with R or python scripts, where we use those languages to create the versions.yml instead of bash. Will this work? or do we have to continue using the old syntax for these cases?

mirpedrol avatar Dec 11 '23 15:12 mirpedrol

I would really like the inputs/ outputs section to remain as concise as possible, and I like the separation of concerns where the command to produce the output happens in more or less the same place. I'd do a bit of a WTF if people suddenly started embedding extensive process stuff where I expect the I/O.

So I have a fairly strong dislike for option 3), I think some fairly horrific stuff could happen there and make the processes hard to understand.

So option 2) for me please!

pinin4fjords avatar Dec 11 '23 15:12 pinin4fjords

Agreeing with @pinin4fjords there, version3 looks beautiful as long as all works well, when it starts to bug, it's a mess to debug.

maxulysse avatar Dec 11 '23 15:12 maxulysse

@pinin4fjords - note that one of the limitations of cmd (which will be documented) is that it doesn't support newlines.

That will hopefully prevent people from doing anything too horrendous 😆

We could have an nf-core modules linting rule that checks the string length and fails if it's too long, suggesting that people use env in that particular case instead.

ewels avatar Dec 11 '23 15:12 ewels

@pinin4fjords - note that one of the limitations of cmd (which will be documented) is that it doesn't support newlines.

That will hopefully prevent people from doing anything too horrendous 😆

There's plenty of evil to be done with pipes!

pinin4fjords avatar Dec 11 '23 15:12 pinin4fjords

But I wonder about modules with R or python scripts, where we use those languages to create the versions.yml instead of bash. Will this work? or do we have to continue using the old syntax for these cases?

@mirpedrol - No it won't work. Suggestion would be to use env in these cases as in option 2 (no need for the old syntax with the cat <<-END_VERSIONS stuff). But there are relatively few of these non-bash modules, none in the rnaseq for example I think.

ewels avatar Dec 11 '23 15:12 ewels

rnaseq has a couple of R modules actually, they're just not obvious because they're local- and we will hopefully fix that at some point, and they will then need templates etc.

pinin4fjords avatar Dec 11 '23 16:12 pinin4fjords

Thank you all for your feedback. I still prefer env myself, but Paolo is determined now to add the cmd type, so we will have both and you can use whichever one you prefer.

My preference goes to version2, which I find more explicit, easier to read, but I do love the version3 that removes completely the version generation from the script itself.

Note that the cmd type is still executed in the task script just like an env, it just inserts the command for you

I like three, my only concern is some of the commands to get the version get pretty long.

@Emiller88 I don't think you can reference local variables in an output as in your example, but you could reference a global variable, for example:

foo_version = 'really | long | version | command'

process foo {
  output:
  cmd("${foo_version}")
}

But I wonder about modules with R or python scripts, where we use those languages to create the versions.yml instead of bash. Will this work? or do we have to continue using the old syntax for these cases?

@mirpedrol In this PR I changed all the processes to only emit the metadata and then the YAML is constructed at the end of the pipeline. If you usually generate the tool version from within a Python or R script, the cmd output could do something like python script.py --version to retrieve the version from Bash. If the process script itself is not Bash, however, then the cmd output won't work. So whenever cmd isn't supported or would be unwieldy to use, you can always fallback to an env

note that one of the limitations of cmd (which will be documented) is that it doesn't support newlines

You could have a multi-line command by using semi-colons for newlines 😅

Regarding multi-line outputs, we found a way to support them for both env and cmd. So whereas currently env outputs are squashed to a single line, both will support multi-line output going forward.

bentsherman avatar Dec 11 '23 23:12 bentsherman

So I think everyone agrees that options 2 + 3 are both improvements ✅

For any processes with script blocks written in languages other than bash, we will have to use the env approach. For bash commands I see now three options, which maybe we can vote on in the nf-core Slack:

Option 1: env

// env output
process FOOBAR {
    output:
    tuple val("${task.process}"), val('foo'), env(FOO_VERSION), topic: versions
    tuple val("${task.process}"), val('bar'), env(BAR_VERSION), topic: versions

    """
    # ...

    FOO_VERSION=\$(foo --version)
    BAR_VERSION=\$(bar --version)
    """
}

Option 2: cmd

// cmd output
process FOOBAR {
    output:
    tuple val("${task.process}"), val('foo'), cmd('foo --version'), topic: versions
    tuple val("${task.process}"), val('bar'), cmd('bar --version'), topic: versions

    """
    # ...
    """
}

Option 3: cmd + variable

// cmd + variable output
foo_version = 'foo --version'
bar_version = 'bar --version'
process FOOBAR {
    output:
    tuple val("${task.process}"), val('foo'), cmd(foo_version), topic: versions
    tuple val("${task.process}"), val('bar'), cmd(bar_version), topic: versions

    """
    # ...
    """
}

ewels avatar Dec 12 '23 09:12 ewels

I think the real thing this could open up is parsing the version string in groovy as another option

// cmd + variable output
foo_version = getVersionFromString('foo --version')
bar_version = 'bar --version'

process FOOBAR {
    output:
    tuple val("${task.process}"), val('foo'), cmd(foo_version), topic: versions
}

in a lib far far away:

def getVersionFromString(String text) {
    def matcher = text =~ /v(\d+\.\d+\.\d+)/
    return matcher ? matcher[0][1] : null
}

Just a thought.

edmundmiller avatar Dec 12 '23 21:12 edmundmiller

The thing is that the command must be executed in the task environment, because Nextflow might not have access to the tool from outside the task.

You could just emit the raw output of the tool version command, remove the duplicates, and then parse the string in Groovy:

process FOOBAR {
    output:
    tuple val("${task.process}"), val('foo'), cmd('foo --version'), topic: versions
}

Channel.topic('versions') .map { process, tool, raw_version ->
    [ process, tool, getVersionFromString(tool, raw_version) ]
}

That comes down to whether you would rather parse the version with a Bash one-liner or Groovy code. Note that you have to write a custom parser for every tool, so putting it all in a lib far far away would break the modularity of your modules. Unless you have a way to "register" a parser from the module script.

bentsherman avatar Dec 12 '23 21:12 bentsherman

Closing for now -- follow https://github.com/nf-core/fetchngs/pull/347 for latest updates

bentsherman avatar May 16 '25 02:05 bentsherman