Infer collection element naming with `default_identifier_source`
The issue
When collections are created as a tool output, the Galaxy tool author can declare output element names by referencing input datasets. However, this feature is poorly documented and seems to be used in only one IUC tool, macs2_callpeak. Its use is relevant for any tool that accepts paired inputs and produces paired outputs (i.e. many tools, including but not limited to IUC). Without this declaration, Galaxy will default to naming output collection elements after the first input dataset element. This results in "R2" outputs being named as "R1" which is obviously misleading and confusing for the user.
Part of this issue is that the default_identifier_source attribute doesn't seem to be documented properly in the Tool XML Schema page. In particular, the weird pipe syntax for referencing nested variables (e.g. conditional_name|param_name).
As described in issue https://github.com/galaxyproject/tools-iuc/issues/5703:
- The Trimmomatic job was submitted with paired reads in two collections: R1 and R2
- The job outputs four collections: Unpaired R1, Unpaired R2, Paired R1, Paired R2
- In the screenshot above the "Unpaired R2" collection is selected in the History
- Note the input files
gah.fastqsanger(R1) andGSM461181_2.fastqsanger(R2) - Now look at the first element in the History. It has been named after the first input dataset in the job (R1), and not after
${readtype.fastq_r2_in.name}(highlighted in Dataset info panel) as I would have expected. - The result is that the
XXX_R2.fastqsangerfiles are being namedXXX_R1.fastqsangerwhich looks like a bug?
The above issue refers to Trimmomatic, which has been fixed by https://github.com/galaxyproject/tools-iuc/pull/5728.
Proposed solution
- A repo-wide application of the
default_identifier_sourceattribute on<data>tags, where appropriate, as has been applied to Trimmomatic:<data name="fastq_out_r1_paired" default_identifier_source="readtype|fastq_r1_in" label="${tool.name} on ${readtype.fastq_r1_in.name} (R1 paired)" format_source="fastq_r1_in"> - Proper documentation of the
default_identifier_sourceattribute in the Tool XML Schema
grep --include "*xml" "<data" tools tool_collections -r | egrep "forward|R1":
- [ ] tool_collections/kraken2/kraken2/kraken2.xml
- [ ] tools/cutadapt/cutadapt.xml https://github.com/galaxyproject/tools-iuc/pull/5895
- [ ] tools/sickle/sickle.xml
- [ ] tools/flash/flash.xml
- [ ] tool_collections/galaxy_sequence_utils/fastq_paired_end_interlacer/fastq_paired_end_interlacer.xml
- [ ] tools/prinseq/prinseq.xml
- [ ] tools/novoplasty/novoplasty.xml
- [ ] tools/krakentools/extract_kraken_reads.xml
- [ ] tools/umi_tools/umi-tools_whitelist.xml
- [ ] tools/rcorrector/rcorrector.xml
- [ ] tools/pear/pear.xml
- [ ] dada2 https://github.com/galaxyproject/tools-iuc/pull/5896
- [ ] tools/stacks/stacks_clonefilter.xml
- [ ] tools/adapter_removal/adapter_removal.xml
- [ ] tools/bbtools/bbnorm.xml
- [ ] tools/read_it_and_keep/read-it-and-keep.xml
- [ ] tools/megan/sam2rma.xml
had a quick look at all of them.
I've added this, and
weird pipe syntax for referencing nested variables
is documented, but the table doesn't render correctly. The text is "This only applies to collections that are mapped over a non-collection input and that have equivalent structures. If this references input elements in conditionals, this value should be qualified (e.g. cond|input instead of input if input is in a conditional with name="cond")".
This "weird" syntax is what you should always use because input can be anything under any conditional, and the last one wins.
Also another note here, your collection elements (this applies to datasets as well) shouldn't have an extension (redundant with the datatype) or the r1 / r2 suffix (that only applies to collections). Element identifiers are supposed to represent samples, in which case this goes away. And the input should be list:paired in the first place, not separate lists.
but the table doesn't render correctly
Right... the pipe is rendered as a table column. It would be nice if that could be fixed so the documentation makes sense.
Element identifiers are supposed to represent samples
Of course this makes sense, but users will often upload read files to collections with an R1/R2 suffix. Hence why this became an issue for Anna. So we should aim to make the software intuitive for this use case, rather than expecting them to understand this problem and re-name all their collection elements.
@bernt-matthias would these tools need to be manually tested, or just apply the default_element_identifier and push? It's not such a long list.
would these tools need to be manually tested, or just apply the default_element_identifier and push? It's not such a long list.
I would maybe test a few of the important ones (also to get experience in doing it right).
I reordered the list (roughly) by usage on EU (using the counts of the most frequently used version).
Ok I can take a look at that, thanks @bernt-matthias