spades icon indicating copy to clipboard operation
spades copied to clipboard

metaSPAdes stuck on read error correction

Open kevinxchan opened this issue 5 years ago • 26 comments

I've been running metaSPAdes on my dataset for about one week but it looks to be stuck on the error correction step of the pipeline and hasn't progressed since. I find it strange since I've ran metaSPAdes on other datasets before to completion, so perhaps it has something to do with my data (I can send this over if needed). From the log, it looks to be stuck on the file GS-Blade-bottom-all_merged.fq.gz, which makes me think this file has a problem in it. But earlier in the spades pipeline it was processed, so I'm not sure whats happening?

params.txt spades.log

kevinxchan avatar Aug 01 '18 16:08 kevinxchan

Thank you for your interest in SPAdes.

This is known issue, we will try to fix it in SPAdes 3.12.1

asl avatar Aug 01 '18 22:08 asl

I think I have run in to this similar issue wherein 10 samples out of 53 got hung up at this step with 3.12.0. Samples were prefiltered for quality, adapters, polyg and human reads. Would reverting to an older version help or is there any potential work around? Thanks, Jordan

jbisanz avatar Aug 22 '18 23:08 jbisanz

The only "workaround" so far is to skip the read error correction step via --only-assembler. The results ought to be sub-optimal though and the assembly would require much more time & memory.

asl avatar Aug 23 '18 03:08 asl

I believe I've hit this same issue, on three different assemblies. I was running 3.12.0 and just downloaded 3.13.0 and had the same issue. I would use the word looping instead of stuck since Metaspades is using all the CPU I gave it.
spades.log

Is this the same issue as above? And still open with the only workaround being --only-assembler?

Thank you, Lisa

louellette avatar Nov 19 '18 15:11 louellette

It's likely the same issue. We hoped to get it fixed, but most probably some corner cases still remains.

asl avatar Nov 25 '18 13:11 asl

Is this still an issue in the latest version of spades (3.13.0)? I think I am running into similar issues in v3.12.0. Spades seems to stop at this step:

0:20:32.848 552M / 1G INFO General (main.cpp : 173) Subclustering Hamming graph

ammaraziz avatar Mar 05 '19 05:03 ammaraziz

@ammaraziz There is no problem in your case. Subclustering might be slow depending on your dataset. Though, there stuck could be caused by I/O problems on your end.

asl avatar Mar 05 '19 08:03 asl

Hi @asl you are correct, there was an issue the first time around that killed the job at the subclustering stage, and because it takes a while (~10 hours for me), I assumed it had gotten stuck on the second run.

Thank you for the quick reply and thank you for Spades!

ammaraziz avatar Mar 05 '19 22:03 ammaraziz

I think I encountered the same issue here. Using SPAdes 3.13.0. Why does some samples proceed just fine and others get stuck in the read correction process?

spades.log

metaganal avatar Mar 11 '19 05:03 metaganal

I am also using SPAdes 3.13.0 and ran into the same issue where the pipeline stopped during error correction at "processing reads". Because many of my assemblies are very large it is difficult to identify stuck jobs vs ones that are just taking a long time to assemble. It would be great to have a fix for this, and I'm happy to provide samples that get stuck for debugging purposess. One difference from other users is that I am also specifying merged reads in addition to forward and reverse reads for my metagenome assembly. Thank you! Log file attached.

spades.log

brymerr921 avatar Apr 11 '19 17:04 brymerr921

Hello

This is known problem with CQF library we're using. For some reason it just stuck. We hope to be able to at least somehow workaround this issue in the next SPAdes release.

asl avatar Apr 23 '19 20:04 asl

As I commented in #355, skipping pre-metaspades quality filtering (I was using fastp) seemed to solve this issue for my problematic samples

franciscozorrilla avatar Oct 10 '19 11:10 franciscozorrilla

Hi,

I am using SPAdes v3.13.1 to assemble Illumina short paired-end reads 2x101. In one of my libraries I have 96 samples all of which assembled quite nicely except in one sample spades hanged on the read error correction step. Specifically it seems to have problem counting kmers. Bypassing quality trimming did not change this behavior for this sample. I tried to bypass other steps before assembling but with no success. I attached params and log files.

I am wondering if I have used the correct settings for 2x101 or whether I am missing something. Thank you so much for this great tool and for your help in advance.

params.txt spades.log

ralsallaq avatar Oct 14 '19 21:10 ralsallaq

@ralsallaq This is known problem with CQF implementation. So far there is no workaround. You may want to try --only-assembler after quality trimming

asl avatar Oct 14 '19 21:10 asl

Hi, I am having the same issue with spades 3.15.0. @asl can you elaborate on the expected effect of skipping the error correction step? I noticed that error correction with --meta for each of the r1 and r2 files separately (passed via -s) doesn't get stuck. Is it a good idea to use these error corrected reads for spades with --meta --only-assembler? Also, error correction without --meta doesn't get stuck with both r1 and r2 passed together. Are there fundamental differences in the error correction with and without --meta? Are there fundamental differences in error correction for single and paired end data with --meta? Thanks, Ilya.

ilyavs avatar Jan 24 '21 10:01 ilyavs

No work was done in 3.15 to address this issue due to lack of bandwidth. You may skip the read error correction, the running time and memory consumption might be larger though.

asl avatar Feb 02 '21 10:02 asl

Skipping the error correction results in assembly differences compared to assembly with error correction (for samples that it worked with error correction). The run time is less of an issue but the differences in assembly are a problem for our use case. They result in higher strain heterogeneity in checkm.

ilyavs avatar Mar 16 '21 12:03 ilyavs

The difference in assembly results are surely expected. Note that read error correction could certainly (falsely) collapse the variation as it might be hard to distinguish the sequencing artefacts from the strain variability in low abundant strains.

asl avatar Mar 16 '21 12:03 asl

I understand the consequences of read error correction. We still need to use it, Is there any creative workaround? Performing the error correction for a concatenate file of R1 and R2? Or something like that?

ilyavs avatar Mar 17 '21 08:03 ilyavs

Hi all,

I am using the latest SPAdes 3.15.4 to assemble 12 datasets. It worked quit well for the first four, and then stuck after processing the forward reads from the 5th dataset.

I found this issue, and wonder whether there is any update regarding this problem.

Thanks in advance.

Z-DAI avatar Feb 28 '22 11:02 Z-DAI

@Z-DAI So far there were no changes. Use `--only-assembler.

asl avatar Feb 28 '22 11:02 asl

Hi there. It appears that I am the same issue that brymerr921 described. I am including forward, reverse, and merged reads using the only-error-correction argument, and it's been stuck on the same metagenome for 4 days. I am wondering if the way to bypass this is to run error correction on fowrard and reverse reads, then merge them after this? Or, should I just run error correction on the merged reads and exclude the F/R reads? Thank you so much for your help.

hlfreund avatar Jul 27 '22 23:07 hlfreund

I had the same issue and worked around it like so:

  • Run metaSPAdes on the stuck dataset with --only assembler three times independently, each time either passing the R1, R2 or unpaired reads (if you have and use them) with the -s option.
  • R1 and R2 will not be paired anymore, so I consolidated them with the following commands (includes seqkit)
# Get a sorted list of R1 and R2 IDs without the @ and the extra info that the metaspades error correction appends:
zcat Sample.R1.fastq.00.0_0.cor.fastq.gz | awk 'NR%4==1 {print $1}' | cut -c 2- | sort > R1.IDs
zcat Sample.R2.fastq.00.0_0.cor.fastq.gz | awk 'NR%4==1 {print $1}' | cut -c 2- | sort > R2.IDs

# Get the IDs that are in both, R1 and R2 files:
comm -12 R1.IDs R2.IDs > paired.IDs

# Get the IDs that are only in R1:
comm -23 R1.IDs R2.IDs > only.R1

# Get the IDs that are only in R2:
comm -13 R1.IDs R2.IDs > only.R2

# Extract the reads with common IDs from R1 and R2:
seqkit grep -f paired.IDs Sample.R1.fastq.00.0_0.cor.fastq.gz > paired.R1.fastq
seqkit grep -f paired.IDs Sample.R2.fastq.00.0_0.cor.fastq.gz > paired.R2.fastq

# Extract the reads unique to R1:
seqkit grep -f only.R1 Sample.R1.fastq.00.0_0.cor.fastq.gz > unpaired.fastq

# Extract the reads unique to R2 and add to the previous file:
seqkit grep -f only.R2 Sample.R2.fastq.00.0_0.cor.fastq.gz >> unpaired.fastq

# Add the original unpaired reads to the file:
zcat Sample.unpaired.fastq.00.0_0.cor.fastq.gz >> unpaired.fastq

Doing this on some datasets that did not get stuck, I saw that the outcome is not identical to a proper run with error correction and assembly in one. The assemblies from files made like described above contain fewer scaffolds, total bp and CDSs > 300 bp (predicted with Prodigal).

nikolasbasler avatar Jul 28 '22 08:07 nikolasbasler

Somewhat in line with https://github.com/ablab/spades/issues/152#issuecomment-766323916, I was wondering, @asl, is the error correction step deterministic? If so, can it be performed on individual reads of a pair separately? And if so, can the error corrected reads then be used as input together with --only-assembler yielding the same results as with a "regular" run of SPAdes?

Because, this would allow to split the whole process into subprocesses, that could then be performed on individual nodes, i.e., in a HPC setting, the error correction could occur on nodes with less RAM and only reserve nodes with more RAM for the assembly step.

It seems from my empirical observations that the error correction step is largely CPU bound, while for things related to the Hamming graph, I/O seems to be an important factor too. I am currently running a test with --tmp-dir set to /dev/shm(ramdisk) to see if this holds true.

[UPDATE] Related to this, is it possible to run --only-error-correction and, after completion, run --continue on the same output folder? If the final output is then identical, it would allow to execute the error correction on a node/reservation with many CPUs (say, 100) but not a lot of RAM, and then continue with the assembly on a node/reservation with maybe not so many CPUs (say, 40).

Looking forward to your input and thanks for all the support!

Best wishes and stay safe,

Cedric

claczny avatar May 05 '23 09:05 claczny

Somewhat in line with #152 (comment), I was wondering, @asl, is the error correction step deterministic? If so, can it be performed on individual reads of a pair separately? And if so, can the error corrected reads then be used as input together with --only-assembler yielding the same results as with a "regular" run of SPAdes?

Certainly not. As read error correction routine uses information from other reads to perform the correction. If you're having 1000 reads, then you can extract some information and perform such correction. If you'd split into 100 pieces of 10 reads, then... you're out of luck. And results ought to be suboptimal.

Related to this, is it possible to run --only-error-correction and, after completion, run --continue on the same output folder?

Sure. This would work

asl avatar May 05 '23 14:05 asl

@asl Thanks a lot for your reply and the clarification.

claczny avatar May 08 '23 08:05 claczny