tools sniff_format is inconsistant.

Description of the bug

I'm getting a "no header" error on this csv? It's the default check samplesheet.

CI: https://github.com/nf-osi/viralintegration/runs/6267396130?check_suite_focus=true Samplesheet: https://github.com/nf-core/test-datasets/blob/viralintegration/samplesheet/samplesheet.csv

https://github.com/nf-osi/viralintegration/blob/dev/bin/check_samplesheet.py

It works with this samplesheet https://github.com/nf-core/test-datasets/blob/rnaseq/samplesheet/v3.4/samplesheet_test.csv It also works if you remove either of the samples, but not if you have both.

Command used and terminal output

git clone https://github.com/nf-osi/viralintegration.git
cd viralintegration

wget https://github.com/nf-core/test-datasets/raw/viralintegration/samplesheet/samplesheet.csv
python3 bin/check_samplesheet.py samplesheet.csv valid.csv

wget https://github.com/nf-core/test-datasets/raw/rnaseq/samplesheet/v3.4/samplesheet_test.csv
python3 bin/check_samplesheet.py samplesheet_test.csv valid.csv

# OR in viralintegration
gh pr checkout 12
nextflow run . -profile test

System information

No response

May 03 '22 16:05 edmundmiller

I also can't find an example of a pipeline running the sniff_format function.

May 03 '22 16:05 edmundmiller

Dear all,

I've found a similar problem in a pipeline I wrote starting from nf-core template. For what I understand this problem is raised by this call in check_samplesheet.py:

if not sniffer.has_header(peek):

which accordingly with csv.Sniffer.has_header documentation: This method is a rough heuristic and may produce both false positives and negatives. This is the simplest example I can produce:

import io
import csv

test = """sample,fastq_1,fastq_2
200-1-5,1_ID2101_200-1-5-H9H05KWZ-H1_S1_L001_R1_001.fastq.gz,1_ID2101_200-1-5-H9H05KWZ-H1_S1_L001_R2_001.fastq.gz
201-1-9,1_ID2101_201-1-9-H9H05KWZ-A2_S1_L001_R1_001.fastq.gz,1_ID2101_201-1-9-H9H05KWZ-A2_S1_L001_R2_001.fastq.gz
202-1-10,1_ID2101_202-1-10-H9H05KWZ-B2_S1_L001_R1_001.fastq.gz,1_ID2101_202-1-10-H9H05KWZ-B2_S1_L001_R2_001.fastq.gz\n203-1-12,1_ID2101_203-1-12-H9H05KWZ-C2_S1_L001_R1_001.fastq.gz,1_ID2101_203-1-12-H9H05KWZ-C2_S1_L001_R2_001.fastq.gz"""

# read data into array to test with different line combinations
handle = io.StringIO(test)
lines = handle.readlines()

sniffer= csv.Sniffer()

# this will return False
sniffer.has_header("".join(lines))  # False

# however, I can have true with three rows
sniffer.has_header("".join(lines[:3]))  # True

# adding a row break the test
sniffer.has_header("".join(lines[:4]))  # False

# however is not a problem of 4th row
sniffer.has_header("".join(lines[:1]+lines[3:]))  #True

The python documentation describe the heuristic behind this function. For what I understand, renaming sample names with numbers solves this problem.

I understand that this issue is unpredictable and occurs in very few cases, so I don't like to propose adopting a standard for header format, however could be possible to add a parameter to check_samplesheet.py in order to skip this check when I'm sure that this file is correct? Or also providing the header using parameters? This means also modifing samplesheet_check.nf to accept additional parameters, for example using modules.config

Oct 20 '22 15:10 bunop

solved in #2194

Mar 09 '23 14:03 mirpedrol