nascent
nascent copied to clipboard
v.2.2.0 fails on multiple replicates of the same sample
Description of the bug
Hey!
I was running the nascent pipeline v2.2.0 on two replicates of the same sample, and encountered the following error at the step of FASTQC:
Aug-14 10:33:55.296 [Actor Thread 74] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_NASCENT:NASCENT:FASTQC (1)'
Caused by:
Process `NFCORE_NASCENT:NASCENT:FASTQC` input file name collision -- There are multiple input files for each of the following file names: other.fq.gz
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
While v2.1.1 runs without this error. It might be related to #143.
Command used and terminal output
No response
Relevant files
No response
System information
No response
Hey! Could you share some steps to reproduce this error? Maybe a minimal samplesheet?
Are you using something like:
sample,fastq_1,fastq_2
other,https://raw.githubusercontent.com/nf-core/test-datasets/nascent/testdata/SRX882903_T1.fastq.gz,
other,https://raw.githubusercontent.com/nf-core/test-datasets/nascent/testdata/SRX882903_T2.fastq.gz,
Or whatever name you're using?
The command I run is similar to this:
nextflow -bg run nf-core/nascent -r 2.2.0 -profile cbe -work-dir $NXF_WRK -params-file params.json
My samplesheet looks like this:
sample,fastq_1,fastq_2
SAMPLE_REP1,path/to/reads.fq.gz,
SAMPLE_REP1,path/to_other/reads.fq.gz,
And my params.json looks like this:
{
"input": ".\/samplesheet.csv",
"outdir": ".\/outputdir",
"assay_type": "GROseq",
"fasta": "..\/data\/hg19.fa",
"gtf": "..\/data\/gencode.v46lift37.basic.annotation.gtf",
"bwa_index": "..\/data\/hg19.p13.plusMT.no_alt_analysis_set\/"
}
These input files work fine with pipeline v2.1.1
Just tried running dev version, it also fails with the same error.
The command I run is similar to this:
Is there anyway I could get the exact command, not just similar?
I only ask because the error message has other.fq.gz which doesn't match up with the sample IDs in your samplesheet or the name of the fastq file directly.
Only thing I could think of is if you're naming both of the files reads.fastq.gz which is something I've seen throw an error in other pipelines. To that, I'd suggest changing your file names to SAMPLE_REP1,path/to/sample.fq.gz and path/to_other/sample_other.fq.gz.
This causes issues because nf-validation is checking to make sure you didn't accidentally include the sample file twice. But I'd expect that to fail out sooner.
Is there anyway I could get the exact command, not just similar?
I replaced unnecessary details of my local paths preserving the conceptual structure. I'm not sure if knowledge of a path to my work directory would help :)
Only thing I could think of is if you're naming both of the files reads.fastq.gz which is something I've seen throw an error in other pipelines. To that, I'd suggest changing your file names to SAMPLE_REP1,path/to/sample.fq.gz and path/to_other/sample_other.fq.gz.
My samplesheet is:
sample,fastq_1,fastq_2
ANDERSSON_REP1,sortmerna/SRR1596500/out/other.fq.gz,
ANDERSSON_REP1,sortmerna/SRR1596501/out/other.fq.gz,
Filepaths are different, the only shared thing is the final name itself, other.fq.gz. This is a default output of sortmerna: the reads unmapped to rRNAs are collected in other.fq.gz. Two replicates of the same sample are clearly in different directories.
To try out your suggestion, I created a symlink for the second file, so the samplesheet becomes:
sample,fastq_1,fastq_2
ANDERSSON_REP1,sortmerna/SRR1596500/out/other.fq.gz,
ANDERSSON_REP1,sortmerna/SRR1596501/out/other_other.fq.gz,
So both the directories and the names are different. And I ran the dev version. FASTQC successfully completes, but now it fails on SAMTOOLS_INDEX:
ERROR ~ Error executing process > 'NFCORE_NASCENT:NASCENT:FASTQ_ALIGN_BWA:BAM_SORT_STATS_SAMTOOLS:SAMTOOLS_INDEX (ANDERSSON_REP1)'
Caused by:
Process `NFCORE_NASCENT:NASCENT:FASTQ_ALIGN_BWA:BAM_SORT_STATS_SAMTOOLS:SAMTOOLS_INDEX (ANDERSSON_REP1)` terminated with an error exit status (1)
Command executed:
samtools \
index \
-@ 1 \
\
ANDERSSON_REP1.sorted.bam
cat <<-END_VERSIONS > versions.yml
"NFCORE_NASCENT:NASCENT:FASTQ_ALIGN_BWA:BAM_SORT_STATS_SAMTOOLS:SAMTOOLS_INDEX":
samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
END_VERSIONS
Command exit status:
1
Command output:
(empty)
Command error:
INFO: Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
INFO: Environment variable SINGULARITYENV_NXF_TASK_WORKDIR is set, but APPTAINERENV_NXF_TASK_WORKDIR is preferred
INFO: Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
[E::bgzf_read] Read block operation failed with error 4 after 0 of 4 bytes
samtools index: failed to create index for "ANDERSSON_REP1.sorted.bam"
I tried to manually make an index (another node, another samtools version, another working folder), but it failled with the same error. At the same time, samtools head works fine.
So, regarding replicates of the same sample, the issue is indeed with the same file handle. This is an unexpected behaviour: files are in different folders, I created them in a standard way, it seems intuitive to have same file handles for the same type of outcome. And I don't think a user is expected to know about this problem. Anyway, I can approve #161, but still, the UX is not optimal.
Regarding samtools index problem I am rerunning the pipeline with version 2.2.0.
Okay, samtools index works on 2.2.0 but not on dev.