sniff_format is inconsistant.
Description of the bug
I'm getting a "no header" error on this csv? It's the default check samplesheet.
CI: https://github.com/nf-osi/viralintegration/runs/6267396130?check_suite_focus=true Samplesheet: https://github.com/nf-core/test-datasets/blob/viralintegration/samplesheet/samplesheet.csv
https://github.com/nf-osi/viralintegration/blob/dev/bin/check_samplesheet.py
It works with this samplesheet https://github.com/nf-core/test-datasets/blob/rnaseq/samplesheet/v3.4/samplesheet_test.csv It also works if you remove either of the samples, but not if you have both.
Command used and terminal output
git clone https://github.com/nf-osi/viralintegration.git
cd viralintegration
wget https://github.com/nf-core/test-datasets/raw/viralintegration/samplesheet/samplesheet.csv
python3 bin/check_samplesheet.py samplesheet.csv valid.csv
wget https://github.com/nf-core/test-datasets/raw/rnaseq/samplesheet/v3.4/samplesheet_test.csv
python3 bin/check_samplesheet.py samplesheet_test.csv valid.csv
# OR in viralintegration
gh pr checkout 12
nextflow run . -profile test
System information
No response
I also can't find an example of a pipeline running the sniff_format function.
Dear all,
I've found a similar problem in a pipeline I wrote starting from nf-core template. For what I understand this problem is raised by this call in check_samplesheet.py:
if not sniffer.has_header(peek):
which accordingly with csv.Sniffer.has_header documentation: This method is a rough heuristic and may produce both false positives and negatives. This is the simplest example I can produce:
import io
import csv
test = """sample,fastq_1,fastq_2
200-1-5,1_ID2101_200-1-5-H9H05KWZ-H1_S1_L001_R1_001.fastq.gz,1_ID2101_200-1-5-H9H05KWZ-H1_S1_L001_R2_001.fastq.gz
201-1-9,1_ID2101_201-1-9-H9H05KWZ-A2_S1_L001_R1_001.fastq.gz,1_ID2101_201-1-9-H9H05KWZ-A2_S1_L001_R2_001.fastq.gz
202-1-10,1_ID2101_202-1-10-H9H05KWZ-B2_S1_L001_R1_001.fastq.gz,1_ID2101_202-1-10-H9H05KWZ-B2_S1_L001_R2_001.fastq.gz\n203-1-12,1_ID2101_203-1-12-H9H05KWZ-C2_S1_L001_R1_001.fastq.gz,1_ID2101_203-1-12-H9H05KWZ-C2_S1_L001_R2_001.fastq.gz"""
# read data into array to test with different line combinations
handle = io.StringIO(test)
lines = handle.readlines()
sniffer= csv.Sniffer()
# this will return False
sniffer.has_header("".join(lines)) # False
# however, I can have true with three rows
sniffer.has_header("".join(lines[:3])) # True
# adding a row break the test
sniffer.has_header("".join(lines[:4])) # False
# however is not a problem of 4th row
sniffer.has_header("".join(lines[:1]+lines[3:])) #True
The python documentation describe the heuristic behind this function. For what I understand, renaming sample names with numbers solves this problem.
I understand that this issue is unpredictable and occurs in very few cases, so I don't like to propose adopting a standard for header format, however could be possible to add a parameter to check_samplesheet.py in order to skip this check when I'm sure that this file is correct? Or also providing the header using parameters? This means also modifing samplesheet_check.nf to accept additional parameters, for example using modules.config
solved in #2194