tools-iuc icon indicating copy to clipboard operation
tools-iuc copied to clipboard

Infer collection element naming with `default_identifier_source`

Open neoformit opened this issue 1 year ago • 6 comments

The issue

When collections are created as a tool output, the Galaxy tool author can declare output element names by referencing input datasets. However, this feature is poorly documented and seems to be used in only one IUC tool, macs2_callpeak. Its use is relevant for any tool that accepts paired inputs and produces paired outputs (i.e. many tools, including but not limited to IUC). Without this declaration, Galaxy will default to naming output collection elements after the first input dataset element. This results in "R2" outputs being named as "R1" which is obviously misleading and confusing for the user.

Part of this issue is that the default_identifier_source attribute doesn't seem to be documented properly in the Tool XML Schema page. In particular, the weird pipe syntax for referencing nested variables (e.g. conditional_name|param_name).


As described in issue https://github.com/galaxyproject/tools-iuc/issues/5703:

image

  1. The Trimmomatic job was submitted with paired reads in two collections: R1 and R2
  2. The job outputs four collections: Unpaired R1, Unpaired R2, Paired R1, Paired R2
  3. In the screenshot above the "Unpaired R2" collection is selected in the History
  4. Note the input files gah.fastqsanger (R1) and GSM461181_2.fastqsanger (R2)
  5. Now look at the first element in the History. It has been named after the first input dataset in the job (R1), and not after ${readtype.fastq_r2_in.name} (highlighted in Dataset info panel) as I would have expected.
  6. The result is that the XXX_R2.fastqsanger files are being named XXX_R1.fastqsanger which looks like a bug?

The above issue refers to Trimmomatic, which has been fixed by https://github.com/galaxyproject/tools-iuc/pull/5728.

Proposed solution

  • A repo-wide application of the default_identifier_source attribute on <data> tags, where appropriate, as has been applied to Trimmomatic:
    <data name="fastq_out_r1_paired" default_identifier_source="readtype|fastq_r1_in" label="${tool.name} on 
    ${readtype.fastq_r1_in.name} (R1 paired)" format_source="fastq_r1_in">
    
  • Proper documentation of the default_identifier_source attribute in the Tool XML Schema

neoformit avatar Jan 23 '24 23:01 neoformit

grep --include "*xml" "<data" tools tool_collections -r | egrep "forward|R1":

  • [ ] tool_collections/kraken2/kraken2/kraken2.xml
  • [ ] tools/cutadapt/cutadapt.xml https://github.com/galaxyproject/tools-iuc/pull/5895
  • [ ] tools/sickle/sickle.xml
  • [ ] tools/flash/flash.xml
  • [ ] tool_collections/galaxy_sequence_utils/fastq_paired_end_interlacer/fastq_paired_end_interlacer.xml
  • [ ] tools/prinseq/prinseq.xml
  • [ ] tools/novoplasty/novoplasty.xml
  • [ ] tools/krakentools/extract_kraken_reads.xml
  • [ ] tools/umi_tools/umi-tools_whitelist.xml
  • [ ] tools/rcorrector/rcorrector.xml
  • [ ] tools/pear/pear.xml
  • [ ] dada2 https://github.com/galaxyproject/tools-iuc/pull/5896
  • [ ] tools/stacks/stacks_clonefilter.xml
  • [ ] tools/adapter_removal/adapter_removal.xml
  • [ ] tools/bbtools/bbnorm.xml
  • [ ] tools/read_it_and_keep/read-it-and-keep.xml
  • [ ] tools/megan/sam2rma.xml

had a quick look at all of them.

bernt-matthias avatar Jan 24 '24 08:01 bernt-matthias

I've added this, and

weird pipe syntax for referencing nested variables

is documented, but the table doesn't render correctly. The text is "This only applies to collections that are mapped over a non-collection input and that have equivalent structures. If this references input elements in conditionals, this value should be qualified (e.g. cond|input instead of input if input is in a conditional with name="cond")".

This "weird" syntax is what you should always use because input can be anything under any conditional, and the last one wins.

mvdbeek avatar Jan 24 '24 09:01 mvdbeek

Also another note here, your collection elements (this applies to datasets as well) shouldn't have an extension (redundant with the datatype) or the r1 / r2 suffix (that only applies to collections). Element identifiers are supposed to represent samples, in which case this goes away. And the input should be list:paired in the first place, not separate lists.

mvdbeek avatar Jan 24 '24 09:01 mvdbeek

but the table doesn't render correctly

Right... the pipe is rendered as a table column. It would be nice if that could be fixed so the documentation makes sense.

Element identifiers are supposed to represent samples

Of course this makes sense, but users will often upload read files to collections with an R1/R2 suffix. Hence why this became an issue for Anna. So we should aim to make the software intuitive for this use case, rather than expecting them to understand this problem and re-name all their collection elements.

@bernt-matthias would these tools need to be manually tested, or just apply the default_element_identifier and push? It's not such a long list.

neoformit avatar Jan 25 '24 04:01 neoformit

would these tools need to be manually tested, or just apply the default_element_identifier and push? It's not such a long list.

I would maybe test a few of the important ones (also to get experience in doing it right).

I reordered the list (roughly) by usage on EU (using the counts of the most frequently used version).

bernt-matthias avatar Jan 25 '24 09:01 bernt-matthias

Ok I can take a look at that, thanks @bernt-matthias

neoformit avatar Jan 25 '24 19:01 neoformit