spades
spades copied to clipboard
metaviral fails without error notification
I'm running metaviral SPAdes with low-complexity (2-5 viruses) samples, and while it successfully works with most of them, three of them are consistently failing without warnings. The logfile reports no errors, and I even have scaffolds.fasta
and contigs.fasta
files produced, but they are empty (size zero). I had previously recovered complete viral genomes (19 kbp) with SPAdes 3.13 from these very same samples, but with v3.15 even using normal spades I get hundreds of scaffolds <3000 kbp.
From the output, I only find assembly_graph_after_simplification.gfa
but none of the final assembly graph files. In the K99
directory I find files with the edges_before_XXX.fasta
files, but all the components
and final_contigs
files are empty, so I guess something is failing in this step.
Cheers!
Tagging metaviralSPAdes' author for the troubleshooting @Dmitry-Antipov
Hi. Could you please send us the spades.log file? (either to [email protected] or as attach here)
Thanks. Here it is. spades.log Cheers!
I have run into the same issue as @geboro. I tested several spades modules on 25 isolate miseq viral libraries. All modules produce a contigs.fasta file except when using --metaviral flag, in which case the file is empty for several samples. Spades runs to completion without errors. There is a warning about the insert length, but this appears in the log files for successful runs as well spades.log
Hi Actually it is normal that metaviralSPAdes do not detect any viral-like contigs for some samples where there are no circular (and specific linear) paths with some conditions on coverage and length, but this should not happen for isolate viral libraries. Is it possible that there are quasispecies or groups of relative species in these libraries? This can be seen if you look (or send us) on the graph before all metaviral procedures - .../gdFB431/K127/assembly_graph_after_simplification.gfa
In this specific case you have very high average coverage - that may also prevent viralSPAdes from finding complete viruses in the data.
I don't think the high coverage is a problem. Many other libraries had similar coverage (500-1000x) and finished without issues.
I've attached the assembly graph file. I'd be very interested in determining if this (or other) libraries contained closely related, but distinct viral strains/species.
Update: I took a look at the assembly_graph_after_simplification.gfa
files. In the two libraries that failed to yield a finished assembly, there was a low ratio segments (S) to links (L) (mean=4.25) relative to the rest of the libraries (mean=38x). I think this answers the question and indicates that there was strain variation in these two libraries, but I'd welcome any insights you may have.
Yes, this looks like a case with multiple strains - we can see three bulges of similar length and a complex region
With lower coverage metaviralSPAdes could output one (with higher coverage) of these strains, but the coverage is too high - metaviralSPAdes has cutoff 600x for edge removal procedure.
Speaking on segments to link ratio - it may rather correspond mostly to the low coverage trash contigs (that were removed from the picture above) - there are lots of isolated trash contigs with low coverage, and with higher dataset coverage there will be more of those.
Thanks, this is resolved as far as I'm concerned. I might suggest adding a warning or something to the log file that indicates why no final assembly is output. That might help future users.
Also, if you have any pointers for extracting this information from the assembly graph (number and size of bulges), that would be great. With the goal of flagging assemblies that might contain multiple strains.
Note that SPAdes 3.15.4 includes a dedicated diagnostics for empty output here. So it won't come as a surprise :)
For the last assembly (shown above) I downsampled the library to 650x coverage and the program output a circular genome. However, I'm dealing with one last tricky phage library for which metaviralspades won't complete even after downsamping. The coverage is ~500x and the expected genome size is ~75 Kbp. Looking at the assembly graph there are 6 small bulges <2 Kbp that do not have abnormally high coverage.
I've attached the log and assembly graph: spades.log assembly_graph_after_simplification.gfa.zip
Hello, I am new to this, does this mean we need to do downsample our data? If so, does this affect the final assembly or even the number of viral species we could find. Sorry if this is a stupid question. I hope someone could also give me a reference for @snayfach's statement "there was a low ratio segments (S) to links (L) (mean=4.25) relative to the rest of the libraries (mean=38x)" I really don't understand this.