Pipeline erroring out because of problem with salmon index
Description of the bug
Hi,
I have been having issues with nf-core/rnaseq erroring out at the salmon step. I've got a lengthy discussion over at the salmon github page (https://github.com/COMBINE-lab/salmon/issues/830), and it's worth noting here too.
I'm working with 15 samples, with ~5Gb total reads per sample (90,000,000 to 100,000,000 reads, ~75 bp reads). I've tried running these samples through the nf-core/rnaseq pipeline, but the pipeline took an age to run before dying at the salmon quant steps. Some samples finished in about 12 minutes. Others timed out after 8+ hours.
After the debugging described in the issue linked above, it seems most likely that the salmon index created during nf-core/rnaseq after skipping alignment steps is having unexpected behaviour, causing salmon to run for 8 hours+ with no mapping occurring. Manually running the exact same salmon command outside of nextflow with the same salmon singularity image, but with the pre-computed refgenie salmon index (refgenie pull hg38/salmon_sa_index), resulted in the mapping finishing in 11 minutes or so. I'm currently re-running nf-core/rnaseq specifying the refgenie salmon index with --salmon_index to see what happens.
There other other steps of the pipeline that seem to take unusually long too, such as NFCORE_RNASEQ:RNASEQ:CAT_FASTQ taking 30 mins+ to not even have completed half the samples. All it's doing is combining the lanes with cat, right? Why does that take so long?
Finally, I'd like to query why some very large files are being duplicated. For example, the gentrome.fa file created by nf-core/rnaseq and needed by salmon appears twice in the work-dir:
$ find . -name "gentrome.fa" -exec ls -ltrh {} \;
-rw-rw-r-- 1 cfos cfos 3.4G Feb 22 11:49 ./work/a6/dbb86d0e4a92af341697e2c6163f28/gentrome.fa
-rw-rw-r-- 1 cfos cfos 3.4G Feb 22 12:12 ./work/dc/1a9f314d55dbf332d8113ea557f807/gentrome.fa
Thanks for your help.
Command used and terminal output
nextflow run nf-core/rnaseq --max_memory 55.GB --fasta /data/reference_genomes/GRCh38/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz --gtf /data/reference_genomes/GRCh38/Homo_sapiens.GRCh38.106.gtf.gz --skip_alignment --pseudo_aligner salmon --seq_center 'Ramaciotti Centre for Genomics' --input samplesheet.csv --outdir nf-core_results --save_merged_fastq true --skip_markduplicates true --extra_salmon_quant_args '--seqBias --gcBias --posBias' -profile singularity
Relevant files
System information
nextflow version 22.10.7 Hardware: desktop with local executor Container engine: singularity Version of nf-core/rnaseq: v3.10.1-g6e1e448
Desktop and OS details:

Hi @charlesfoster @rob-p ! Did you manage to find a solution to this? Is there something we can fix in the pipeline?
$ find . -name "gentrome.fa" -exec ls -ltrh {} ; -rw-rw-r-- 1 cfos cfos 3.4G Feb 22 11:49 ./work/a6/dbb86d0e4a92af341697e2c6163f28/gentrome.fa -rw-rw-r-- 1 cfos cfos 3.4G Feb 22 12:12 ./work/dc/1a9f314d55dbf332d8113ea557f807/gentrome.fa
From the .nextflow.log you uploaded it looks like the SALMON_INDEX process failed initially and was automatically re-submitted by the pipeline. Hence why that file appears twice:
Feb.-22 12:12:24.105 [Task monitor] INFO nextflow.processor.TaskProcessor - [a6/dbb86d] NOTE: Process `NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:SALMON_INDEX (genome.transcripts.fa)` terminated with an error exit status (139) -- Execution is retried (1)
Feb.-22 12:12:24.110 [Task submitter] DEBUG n.executor.local.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
Feb.-22 12:12:24.111 [Task submitter] INFO nextflow.Session - [dc/1a9f31] Re-submitted process > NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:SALMON_INDEX (genome.transcripts.fa)
I have the same issue that lead me here, timeout at the NFCORE_RNASEQ:RNASEQ:QUANTIFY_SALMON:SALMON_QUANT step. The last logs read:
[2023-04-27 07:37:49.110] [jointLog] [info] done
--
[2023-04-27 07:37:49.213] [jointLog] [info] Index contained 126 targets
[2023-04-27 07:37:49.213] [jointLog] [info] Number of decoys : 1
[2023-04-27 07:37:49.213] [jointLog] [info] First decoy index : 125
Using Salmon 1.9.0 and the test input data:
// Input data
input = 'https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/samplesheet/v3.10/samplesheet_test.csv'
// Genome references
fasta = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/genome.fasta'
gtf = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/genes.gtf.gz'
gff = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/genes.gff.gz'
transcript_fasta = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/transcriptome.fasta'
additional_fasta = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/gfp.fa.gz'
bbsplit_fasta_list = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/bbsplit_fasta_list.txt'
hisat2_index = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/hisat2.tar.gz'
salmon_index = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/salmon.tar.gz'
rsem_index = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/rsem.tar.gz'
Running on AWS batch
Anyone found a potential cause for this issue? Weird that the index works fine for some samples and not others. Unfortunately, I will need some way of reproducing on my end or a suggestion for a fix to deal with this in the pipeline.
Please upgrade to salmon >=1.10. It fixes a rare but persistent segmentation fault in the index construction of some references.
Thanks! Latest version of the pipeline (v3.11.2) uses a compatible version: https://github.com/nf-core/rnaseq/blob/5671b65af97fe78a2f9b4d05d850304918b1b86e/modules/nf-core/salmon/quant/main.nf#L5
Be awesome if you can test @charlesfoster @jambler24 and let me know if you still have the same issue.
I think that since the error looks fixed from our end and we've had no response from the OP in a year we can consider this one sorted. Please reopen if that's not the case.