bacannot Use of named outputs in processes

Hey Felipe,

Building upon the stylistic change suggested in https://github.com/fmalmeida/bacannot/issues/38

Perhaps instead of using the prokka.out[1] etc, we can rely upon named outputs


      annotations_files_ch = (PROKKA.out.renamed_genome).join(PROKKA.out.gffAnnotation)
                                       .join(MLST.out.mlstAnalysis)
                                       .join(BARRNAP.out.gffTuple)

This way you don't really need to rely upon sections like

        antismash_output = antismash.out[0]

... since you can directly call it antismash.out.result. This can lead to a significant reduction in the number of identifiers used in the main.nf script.

The use of camelCase for emitted channel names is a personal preference of mine, to distinguish between the identifiers coming from the user snake_case and the identifiers I've used within the pipeline.

Nov 16 '21 17:11 abhi18av

Hi @abhi18av,

Recently we have discussed about that here in the lab. And we agreed that we should not rely on sections like:

antismash_output = antismash.out[0]

Because, although it works, it is not clear enough which input/output is being used in the modules And I believe it would be nice to have these changes to it easier to debug and work on the code.

However, in the lab we discussed to save and use the modules outputs in the Groovy maps instead of directly through variables or named outputs. We were thinking in the beginning of the workflow to initialize a Groovy map to capture the results and save them with names that better describe the results so any one can look at the code and easily understand what is being used.

For example, we'd initialize a map:

def OUTPUTS = [:]

And them save the results (after produced) into these maps with descriptive names to be reused as input to other modules in the pipeline, for example:

OUTPUTS['prokkaGff']                         = prokka.out[1]

Perharps, we could have both? To have things as clear as possible in the code, and to make it easier to reuse the results? For example:

OUTPUTS['prokkaGff'] = prokka.out[1]
OUTPUTS['prokkaRenamedGenome'] = prokka.out[3]

...

annotations_files_ch = (OUTPUTS['prokkaRenamedGenome']).join(OUTPUTS['prokkaGff'] )
                                       .join(OUTPUTS['mlstAnalysis'])
                                       .join(OUTPUTS['barnappGff'])

What do you think? Would it be too much? Only the named channels already solve the problem?

Nov 17 '21 17:11 fmalmeida

Hmm, this is an interesting idea!

Putting channel data in a hash-map - I'll do some experiments tomorrow regarding this. My gut feeling is that this might not work, since we're essentially mixing the data structure which is not thread safe Hash map with a data structure channel which is designed for parallelism. But definitely liked the suggestion overall I'm curious to investigate this 🕵️

Beyond this, I think that even with the usage of OUTPUT, we are not resolving the problem of reducing the number of identifiers in the global namespace OUTPUT['abc'] would still be an identifier in the global namespace, no?

x-ref: https://www.baeldung.com/java-concurrent-map

Nov 17 '21 19:11 abhi18av

Re: Hash-map

Let's see what happens, if it does not work, it is ok. Was just an ideia we had in the lab :)

Re: number of identifiers

The number of identifiers and sections where I create them exists just because of one single problem I had:

The pipeline has optional modules. For instance a user may or may not execute the antismash module.

However, whenever I skipped a module, when I arrived at the reporting and jbrowse modules where I grab all the annotation results for each sample with .join, nextflow complained that antismash.out[1] didn't exist, even though using reminder: true.

So to solve that, I relied upon the identifiers to load in it the results when the module was executed and an empty channel when it was skipped.

I don't like it either to have it a bunch of identifiers. For me the best would be to use only the named outputs.

However, I am not an expert in Groovy, I learnt everything from nextflow's manual 😅 and I didn't know how to solve this problem that I faced when skipping modules more elegantly.

But if you know how to diminish the number of identifiers relying only in the named outputs and avoiding this error when skipping module. I would be super happy to use it how you first described when opened the issue 😁

Nov 17 '21 22:11 fmalmeida

In the small sample below

nextflow.enable.dsl = 2


process SAY_HELLO {
    tag "${x}"

    input:
        val(x)

    output:
        path("*txt")

    script:
        """
        sleep \$[ ( \$RANDOM % 10 )  + 1 ]s
        echo $x > ${x}.txt
        """

}

process SAY_BYE {
    tag "${x}"

    input:
        path(x)

    output:
        path("*txt")

    script:
        """
        sleep \$[ ( \$RANDOM % 10 )  + 1 ]s
        cat $x > temp.txt
        """

}


workflow {

    cheers_ch = Channel.of('Hello', 'Hola', 'Ola', 'Nihao', 'Namaste')

    SAY_HELLO(cheers_ch)

    // SAY_BYE(SAY_HELLO.out)

    def output_map = [:]
    output_map[0] = SAY_HELLO.out
    

}

Both SAY_BYE(SAY_HELLO.out) and SAY_BYE(output_map[0]) versions seem to work!!

Though, I do think that before the wider scale consumption, it might be worth discussing on the nf-core slack or the Nextflow discussions https://github.com/nextflow-io/nextflow/discussions to have opinion from people who are more experienced.

Nov 18 '21 18:11 abhi18av

The pipeline has optional modules. For instance a user may or may not execute the antismash module.

However, whenever I skipped a module, when I arrived at the reporting and jbrowse modules where I grab all the annotation results for each sample with .join, nextflow complained that antismash.out[1] didn't exist, even though using reminder: true.

So to solve that, I relied upon the identifiers to load in it the results when the module was executed and an empty channel when it was skipped.

Regarding this, okay let's dig deeper. I see that you've followed the general practice of conditional value allocation in batch_workflow


      if (params.skip_antismash == false) {
        antismash(prokka.out[2])
        antismash_output = antismash.out[0]
      } else {
        antismash_output = Channel.empty()
      }

And then this is used later

      // Grab inputs needed for JBrowse step
      jbrowse_input = merge_annotations.out[0].join(annotations_files, remainder: true)
                                              ...
                                              .join(antismash_output,  remainder: true)

But when we see the jbrowse process and the actual files which which have been used in the script section are way less than what has been specified in the input directive.

For example, file(amrfinder), file(vfdb) etc, but they are not used directly within the script section. I understand, these files are used by the tools as indexes / references etc but do they need to be named?

We might simplify the input directive as shown below

tuple val(prefix), file(gff), file(draft), file("prokka_gff") ... path("*")

Using the path("*") we simply stage in the remaining files in an anonymous manner.

However, I am not an expert in Groovy, I learnt everything from nextflow's manual 😅 and I didn't know how to solve this problem that I faced when skipping modules more elegantly.

No worries about this @fmalmeida, I'm pleasantly surprised how much you've absorbed from the manual itself. I myself ended up printing the entire thing to learn - and after all, we're always learning 😊

Nov 18 '21 18:11 abhi18av

Hi @abhi18av,

Re: about the hash maps

Though, I do think that before the wider scale consumption, it might be worth discussing on the nf-core slack or the Nextflow discussions https://github.com/nextflow-io/nextflow/discussions to have opinion from people who are more experienced.

I completely agree with you. It is better to first check if this strategy has drawbacks in parallelization to avoid slowing down the pipeline.

Re: about the optional modules

But when we see the jbrowse process and the actual files which which have been used in the script section are way less than what has been specified in the input directive.

That is really I concern I had. In the best world, it would be best to have created the input channels for each "visualization/reporting" module on demand, without using the annotations_files channel. Because this channel, has all the produced outputs, some of the modules need, others don't ... And to avoid creating multiple input channel identifiers throughout the workflow I chose to save all of them in a single channel so the input directive for post-annotation modules would be standardise (could use the same).

For example, file(amrfinder), file(vfdb) etc, but they are not used directly within the script section. I understand, these files are used by the tools as indexes / references etc but do they need to be named?

Here, I totally agree with you! In the best scenario I would like to have named only the files that I would really use in the script directive, because it is really painful to name them all assuring the proper order. But, I fell in the same pit, which is I did not know how to properly do it.

Using the path("*") we simply stage in the remaining files in an anonymous manner.

This is amazing. I did not know about it. How the anonymous staging actually work? Are the anonymous files indexed as input.1, input.2, etc? I believe the first step to use this path directive would be re-organize the order of inputs (so the used files comes first) in the input directive, because it can only be used in the end, right? Or is it possible to have the following?

tuple val(prefix), file(gff), path("*"), file(mlst), ...

Also, I believe that your repo fork is pointing to previous 2.x versions as batch_workflow.nf doesn't exist anymore.

No worries about this @fmalmeida, I'm pleasantly surprised how much you've absorbed from the manual itself. I myself ended up printing the entire thing to learn - and after all, we're always learning

Yes, we are 😄 thanks, by the way

Nov 18 '21 19:11 fmalmeida

Hi!

I have made a small comment in issue #38 that may have some relevance to this one.

Nov 23 '21 15:11 fmalmeida

... because it can only be used in the end, right? Or is it possible to have the following?
tuple val(prefix), file(gff), path("*"), file(mlst), ...

No, this way, Nextflow would not be able to identify, how exactly to limit the use of * - on the other hand a code like


process MY_PROCESS {

    input:
    tuple val(sampleName), path(bai), path(bam)
    path(ref_fasta)
    path("*")
...
}

can be used as shown below


MY_PROCESS(OTHER_PROCESS.out, 
           params.ref_fasta, 
           [params.ref_fasta_fai, params.ref_fasta_dict])

Nov 23 '21 16:11 abhi18av

Thanks for clarifying it,

I believe ti could work and make things easier to read inside the module. However, since I have made the modules "accept" and understand the same input channel in "standard" manner for all, this would require great efforts inside the bacannot.nf workflow to:

Reconfigure the modules to have named outputs
And reconfigure the workflow itself to dynamically create the input channel for each channel right?

I understood how it would change the module.nf and make it more readable, but I could not fully grasp how this would be implemented in the workflow (and the necessary changes to be made).

Perhaps the best would be to create a branch specific for this issue based on the current master v3.0 and them, keep using the -profile test to check its implementation, right?

Nov 23 '21 16:11 fmalmeida

Agreed, this isn't something that's going to change or impact the usage of this pipeline and we can work on this in parallel.

But it's best, if the develop is stable before we make these changes. See https://github.com/fmalmeida/bacannot/issues/38#issuecomment-976853237

Nov 23 '21 17:11 abhi18av

bacannot bacannot copied to clipboard

Use of named outputs in processes

bacannot
bacannot copied to clipboard