tools Issue in check_samplesheet.py that incorrectly merges samples in pipelines

Description of the bug

It has been observed in check_samplesheet.py that the _T* suffix has been added to samples only when the occurrence of samples is more than 1, leaving all single-library samples (names) are unmodified.

For example, for a sample sheet as below:

sample,single_end,fastq_1,fastq_2,strandedness
SAMPLE_PE,SAMPLE_PE_RUN1_1.fastq.gz,SAMPLE_PE_RUN1_2.fastq.gz,forward
SAMPLE_PE,SAMPLE_PE_RUN2_1.fastq.gz,SAMPLE_PE_RUN2_2.fastq.gz,forward
SAMPLE_SE,SAMPLE_SE_RUN1_1.fastq.gz,,forward
CONTROL_1,CONTROL_1_RUN1_1.fastq.gz,CONTROL_1_RUN1_2.fastq.gz,forward
CONTROL_2,CONTROL_2_RUN1_1.fastq.gz,CONTROL_2_RUN1_2.fastq.gz,forward

the check_samplesheet.py creates the output as:

sample,single_end,fastq_1,fastq_2,strandedness
SAMPLE_PE_T1,True,SAMPLE_PE_RUN1_1.fastq.gz,SAMPLE_PE_RUN1_2.fastq.gz,forward
SAMPLE_PE_T2,True,SAMPLE_PE_RUN2_1.fastq.gz,SAMPLE_PE_RUN2_2.fastq.gz,forward
SAMPLE_SE,False,SAMPLE_SE_RUN1_1.fastq.gz,,forward
CONTROL_1,False,CONTROL_1_RUN1_1.fastq.gz,CONTROL_1_RUN1_2.fastq.gz,forward
CONTROL_2,False,CONTROL_2_RUN1_1.fastq.gz,CONTROL_2_RUN1_2.fastq.gz,forward

and inside any pipeline, the output is then modified with meta_clone.id.split('_')[0..-2].join('_') which strip off the last part from the sample name, thus all single-library samples are then incorrectly merged by CAT_FASTQ

Reported by https://github.com/ojziff in https://github.com/nf-core/rnavar/issues/54

Command used and terminal output

$cat samplesheet.csv
sample,fastq_1,fastq_2,strandedness
SAMPLE_PE,SAMPLE_PE_RUN1_1.fastq.gz,SAMPLE_PE_RUN1_2.fastq.gz,forward
SAMPLE_PE,SAMPLE_PE_RUN2_1.fastq.gz,SAMPLE_PE_RUN2_2.fastq.gz,forward
SAMPLE_SE,SAMPLE_SE_RUN1_1.fastq.gz,,forward
CONTROL_1,test_rnaseq_control1_1.fastq.gz,test_rnaseq_control1_2.fastq.gz,reverse
CONTROL_2,test_rnaseq_control2_1.fastq.gz,test_rnaseq_control2_2.fastq.gz,reverse

$wget https://raw.githubusercontent.com/nf-core/tools/master/nf_core/pipeline-template/bin/check_samplesheet.py

$python check_samplesheet.py samplesheet.csv output.csv

$cat output.csv
sample,single_end,fastq_1,fastq_2,strandedness
SAMPLE_PE_T1,False,SAMPLE_PE_RUN1_1.fastq.gz,SAMPLE_PE_RUN1_2.fastq.gz,forward
SAMPLE_PE_T2,False,SAMPLE_PE_RUN2_1.fastq.gz,SAMPLE_PE_RUN2_2.fastq.gz,forward
SAMPLE_SE,True,SAMPLE_SE_RUN1_1.fastq.gz,,forward
CONTROL_1,False,test_rnaseq_control1_1.fastq.gz,test_rnaseq_control1_2.fastq.gz,reverse
CONTROL_2,False,test_rnaseq_control2_1.fastq.gz,test_rnaseq_control2_2.fastq.gz,reverse

When the pipeline is run (here rnavar for example), the control samples (CONTROL_1 and CONTROL_2) are merged and run as one sample (CONTROL) which is wrong. Refer https://github.com/nf-core/rnavar/issues/54 for more details.

System information

Python 3.8.8 nextflow [22.04.0

error_sample_merge ]

Jun 13 '22 12:06 praveenraj2018

@praveenraj2018 can this issue be closed now?

Aug 04 '22 08:08 mashehu

I am closing the issue as it seems to be fixed by #1633. Feel free to reopen if needed :)

Aug 24 '22 07:08 mirpedrol