nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Add map input/output type for processes

Open multimeric opened this issue 3 years ago • 4 comments

New feature

I would like to be able to produce a single channel with multiple named values (ie a Map) from my processes, ie my_proccess.out.view() should return:

[a:abc, b:123, c:false]
[a:abc, b:123, c:false]
[a:def, b:456, c:true]

Currently we have a tuple qualifier, which creates an output channel which is a tuple of other types. However this tuple is unlabelled, so users have to extract values from this channel by position, which results in confusing code.

It is also possible to produce multiple output channels, each of which has its own name. However, these channels can't easily be combined into a single channel containing maps or tuples, because the merge operator has been deprecated, and in general joining channels by position is discouraged.

Usage scenario

This would be useful when a user is working with mostly map data in their channels, likely because they want each field to be labelled instead of unlabelled as in a tuple.

Suggest implementation

I would envisage a new map qualifier, which is used like this:

process hmmer_search {
    container "quay.io/biocontainers/hmmer:3.3.2--h1b792b2_1"
    input:
      path profile
      path database
    output:
        map [table: path('table.txt'), human_readable: path('match.txt')]
    script:
      """
      hmmsearch -o match.txt --tblout table.txt ${profile} ${database} 
      """
}

multimeric avatar May 24 '21 00:05 multimeric

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 21 '21 02:10 stale[bot]

+1

christopher-hardy avatar Feb 08 '22 19:02 christopher-hardy

Quick workaround is to output a tuple and follow up with a map operator that converts each tuple to a map.

bentsherman avatar Jul 13 '22 22:07 bentsherman

This issue is complementary to #2257, which is about multiple named channels whereas this issue is about named values within a channel (e.g. map). Ideally both use cases should be supported.

A single map channel would be used for 1-to-1 relationships whereas multiple named channels would be used for 1-to-many and many-to-many relationships.

bentsherman avatar Jul 18 '22 16:07 bentsherman

+1

Or maybe immutable named tuples? E.g. output: tuple table: path('table.txt'), human_readable: path('match.txt')

Then all current tuple-reliant functionality (e.g. groupKey) works as before, but one can use names instead of indices when manipulating process results, e.g.

input_ch | MY_PROC | map { it.table }

instead of

| map { it[0] }

notestaff avatar Dec 10 '22 18:12 notestaff