rnaseq icon indicating copy to clipboard operation
rnaseq copied to clipboard

Pipeline erroring out because of problem with salmon index

Open charlesfoster opened this issue 2 years ago • 5 comments

Description of the bug

Hi,

I have been having issues with nf-core/rnaseq erroring out at the salmon step. I've got a lengthy discussion over at the salmon github page (https://github.com/COMBINE-lab/salmon/issues/830), and it's worth noting here too.

I'm working with 15 samples, with ~5Gb total reads per sample (90,000,000 to 100,000,000 reads, ~75 bp reads). I've tried running these samples through the nf-core/rnaseq pipeline, but the pipeline took an age to run before dying at the salmon quant steps. Some samples finished in about 12 minutes. Others timed out after 8+ hours.

After the debugging described in the issue linked above, it seems most likely that the salmon index created during nf-core/rnaseq after skipping alignment steps is having unexpected behaviour, causing salmon to run for 8 hours+ with no mapping occurring. Manually running the exact same salmon command outside of nextflow with the same salmon singularity image, but with the pre-computed refgenie salmon index (refgenie pull hg38/salmon_sa_index), resulted in the mapping finishing in 11 minutes or so. I'm currently re-running nf-core/rnaseq specifying the refgenie salmon index with --salmon_index to see what happens.

There other other steps of the pipeline that seem to take unusually long too, such as NFCORE_RNASEQ:RNASEQ:CAT_FASTQ taking 30 mins+ to not even have completed half the samples. All it's doing is combining the lanes with cat, right? Why does that take so long?

Finally, I'd like to query why some very large files are being duplicated. For example, the gentrome.fa file created by nf-core/rnaseq and needed by salmon appears twice in the work-dir:

$ find . -name "gentrome.fa" -exec ls -ltrh {} \;
-rw-rw-r-- 1 cfos cfos 3.4G Feb 22 11:49 ./work/a6/dbb86d0e4a92af341697e2c6163f28/gentrome.fa
-rw-rw-r-- 1 cfos cfos 3.4G Feb 22 12:12 ./work/dc/1a9f314d55dbf332d8113ea557f807/gentrome.fa

Thanks for your help.

Command used and terminal output

nextflow run nf-core/rnaseq --max_memory 55.GB --fasta /data/reference_genomes/GRCh38/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz --gtf /data/reference_genomes/GRCh38/Homo_sapiens.GRCh38.106.gtf.gz --skip_alignment --pseudo_aligner salmon --seq_center 'Ramaciotti Centre for Genomics' --input samplesheet.csv --outdir nf-core_results --save_merged_fastq true --skip_markduplicates true --extra_salmon_quant_args '--seqBias --gcBias --posBias' -profile singularity

Relevant files

nextflow_error.log

System information

nextflow version 22.10.7 Hardware: desktop with local executor Container engine: singularity Version of nf-core/rnaseq: v3.10.1-g6e1e448

Desktop and OS details:

image

charlesfoster avatar Feb 23 '23 05:02 charlesfoster

Hi @charlesfoster @rob-p ! Did you manage to find a solution to this? Is there something we can fix in the pipeline?

$ find . -name "gentrome.fa" -exec ls -ltrh {} ; -rw-rw-r-- 1 cfos cfos 3.4G Feb 22 11:49 ./work/a6/dbb86d0e4a92af341697e2c6163f28/gentrome.fa -rw-rw-r-- 1 cfos cfos 3.4G Feb 22 12:12 ./work/dc/1a9f314d55dbf332d8113ea557f807/gentrome.fa

From the .nextflow.log you uploaded it looks like the SALMON_INDEX process failed initially and was automatically re-submitted by the pipeline. Hence why that file appears twice:

Feb.-22 12:12:24.105 [Task monitor] INFO  nextflow.processor.TaskProcessor - [a6/dbb86d] NOTE: Process `NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:SALMON_INDEX (genome.transcripts.fa)` terminated with an error exit status (139) -- Execution is retried (1)
Feb.-22 12:12:24.110 [Task submitter] DEBUG n.executor.local.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
Feb.-22 12:12:24.111 [Task submitter] INFO  nextflow.Session - [dc/1a9f31] Re-submitted process > NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:SALMON_INDEX (genome.transcripts.fa)

drpatelh avatar Mar 15 '23 12:03 drpatelh

I have the same issue that lead me here, timeout at the NFCORE_RNASEQ:RNASEQ:QUANTIFY_SALMON:SALMON_QUANT step. The last logs read:

[2023-04-27 07:37:49.110] [jointLog] [info] done
--
[2023-04-27 07:37:49.213] [jointLog] [info] Index contained 126 targets
[2023-04-27 07:37:49.213] [jointLog] [info] Number of decoys : 1
[2023-04-27 07:37:49.213] [jointLog] [info] First decoy index : 125

Using Salmon 1.9.0 and the test input data:

    // Input data
    input = 'https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/samplesheet/v3.10/samplesheet_test.csv'

    // Genome references
    fasta              = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/genome.fasta'
    gtf                = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/genes.gtf.gz'
    gff                = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/genes.gff.gz'
    transcript_fasta   = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/transcriptome.fasta'
    additional_fasta   = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/gfp.fa.gz'

    bbsplit_fasta_list = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/bbsplit_fasta_list.txt'
    hisat2_index       = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/hisat2.tar.gz'
    salmon_index       = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/salmon.tar.gz'
    rsem_index         = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/rsem.tar.gz'

Running on AWS batch

jambler24 avatar Apr 27 '23 08:04 jambler24

Anyone found a potential cause for this issue? Weird that the index works fine for some samples and not others. Unfortunately, I will need some way of reproducing on my end or a suggestion for a fix to deal with this in the pipeline.

drpatelh avatar May 30 '23 11:05 drpatelh

Please upgrade to salmon >=1.10. It fixes a rare but persistent segmentation fault in the index construction of some references.

rob-p avatar May 30 '23 13:05 rob-p

Thanks! Latest version of the pipeline (v3.11.2) uses a compatible version: https://github.com/nf-core/rnaseq/blob/5671b65af97fe78a2f9b4d05d850304918b1b86e/modules/nf-core/salmon/quant/main.nf#L5

Be awesome if you can test @charlesfoster @jambler24 and let me know if you still have the same issue.

drpatelh avatar May 30 '23 17:05 drpatelh

I think that since the error looks fixed from our end and we've had no response from the OP in a year we can consider this one sorted. Please reopen if that's not the case.

pinin4fjords avatar May 29 '24 10:05 pinin4fjords