Issue in check_samplesheet.py that incorrectly merges samples in pipelines
Description of the bug
It has been observed in check_samplesheet.py that the _T* suffix has been added to samples only when the occurrence of samples is more than 1, leaving all single-library samples (names) are unmodified.
For example, for a sample sheet as below:
sample,single_end,fastq_1,fastq_2,strandedness
SAMPLE_PE,SAMPLE_PE_RUN1_1.fastq.gz,SAMPLE_PE_RUN1_2.fastq.gz,forward
SAMPLE_PE,SAMPLE_PE_RUN2_1.fastq.gz,SAMPLE_PE_RUN2_2.fastq.gz,forward
SAMPLE_SE,SAMPLE_SE_RUN1_1.fastq.gz,,forward
CONTROL_1,CONTROL_1_RUN1_1.fastq.gz,CONTROL_1_RUN1_2.fastq.gz,forward
CONTROL_2,CONTROL_2_RUN1_1.fastq.gz,CONTROL_2_RUN1_2.fastq.gz,forward
the check_samplesheet.py creates the output as:
sample,single_end,fastq_1,fastq_2,strandedness
SAMPLE_PE_T1,True,SAMPLE_PE_RUN1_1.fastq.gz,SAMPLE_PE_RUN1_2.fastq.gz,forward
SAMPLE_PE_T2,True,SAMPLE_PE_RUN2_1.fastq.gz,SAMPLE_PE_RUN2_2.fastq.gz,forward
SAMPLE_SE,False,SAMPLE_SE_RUN1_1.fastq.gz,,forward
CONTROL_1,False,CONTROL_1_RUN1_1.fastq.gz,CONTROL_1_RUN1_2.fastq.gz,forward
CONTROL_2,False,CONTROL_2_RUN1_1.fastq.gz,CONTROL_2_RUN1_2.fastq.gz,forward
and inside any pipeline, the output is then modified with meta_clone.id.split('_')[0..-2].join('_') which strip off the last part from the sample name, thus all single-library samples are then incorrectly merged by CAT_FASTQ
Reported by https://github.com/ojziff in https://github.com/nf-core/rnavar/issues/54
Command used and terminal output
$cat samplesheet.csv
sample,fastq_1,fastq_2,strandedness
SAMPLE_PE,SAMPLE_PE_RUN1_1.fastq.gz,SAMPLE_PE_RUN1_2.fastq.gz,forward
SAMPLE_PE,SAMPLE_PE_RUN2_1.fastq.gz,SAMPLE_PE_RUN2_2.fastq.gz,forward
SAMPLE_SE,SAMPLE_SE_RUN1_1.fastq.gz,,forward
CONTROL_1,test_rnaseq_control1_1.fastq.gz,test_rnaseq_control1_2.fastq.gz,reverse
CONTROL_2,test_rnaseq_control2_1.fastq.gz,test_rnaseq_control2_2.fastq.gz,reverse
$wget https://raw.githubusercontent.com/nf-core/tools/master/nf_core/pipeline-template/bin/check_samplesheet.py
$python check_samplesheet.py samplesheet.csv output.csv
$cat output.csv
sample,single_end,fastq_1,fastq_2,strandedness
SAMPLE_PE_T1,False,SAMPLE_PE_RUN1_1.fastq.gz,SAMPLE_PE_RUN1_2.fastq.gz,forward
SAMPLE_PE_T2,False,SAMPLE_PE_RUN2_1.fastq.gz,SAMPLE_PE_RUN2_2.fastq.gz,forward
SAMPLE_SE,True,SAMPLE_SE_RUN1_1.fastq.gz,,forward
CONTROL_1,False,test_rnaseq_control1_1.fastq.gz,test_rnaseq_control1_2.fastq.gz,reverse
CONTROL_2,False,test_rnaseq_control2_1.fastq.gz,test_rnaseq_control2_2.fastq.gz,reverse
When the pipeline is run (here rnavar for example), the control samples (CONTROL_1 and CONTROL_2) are merged and run as one sample (CONTROL) which is wrong. Refer https://github.com/nf-core/rnavar/issues/54 for more details.
System information
Python 3.8.8 nextflow [22.04.0
]
@praveenraj2018 can this issue be closed now?
I am closing the issue as it seems to be fixed by #1633. Feel free to reopen if needed :)