[Proposal] Adding a grouping column to the sample sheet

Open TheodoreMarkulin opened this issue 3 years ago • 0 comments

Description of feature

What is being proposed?

Currently the base sample sheet template uses: sample, fastq_1, fastq_2, and single_end. I would like to propose adding a 5th column called group to the sample sheet.

What does this solve?

A single sample may be run on the same experiment multiple times under different conditions. The current method for remedying this (within the validate_unique_samples function) is to append a _T# increment to the end of each sample name that appears more than once. https://github.com/nf-core/tools/blob/171127bd850040e76febd0945e6980b7afcaad69/nf_core/pipeline-template/bin/check_samplesheet.py#L128-L129

By adding a grouping column, identical samples belonging to the same group can be modified by appending the group name instead of a _T#.

The main reason I'm proposing a grouping column though is for downstream analysis. Should someone want to integrate differential analysis into a copy of a pipeline, they need to come up with a process to feed in the grouping information outside of the assets already provided to the pipeline. Adding this sort of column would allow for a natural path to integrating such analysis. Furthermore, being able to group samples (groupby group, sample name) allows for easily integrating other processes, such as FASTQ concatenation.

May 19 '22 15:05 TheodoreMarkulin