spades
spades copied to clipboard
Genome coverage inflection point for assembly quality (contig number and BUSCO)
Description of bug
I am using Spades for genome assembly of Illumina PE data and while it runs great to completion on each job I've done, I'm finding there is an inflection point in genome coverage after which the number of contigs dramatically increases while the number of complete BUSCO genes dramatically decreases. I'm new to using Spades and this may be a common finding but it surprised me - and I didn't see anything in the issues current or archived that I checked out. I've found the same similar result across a series of genome assemblies for two different species in the same genus (Lottia) so far.
My input data are Illumina PE reads (ILLUMINA (HiSeq X Ten) run: 83.3M spots, 25G bases) generated this spring. My pipeline going into Spades includes dropping all reads with greater than 5% Ns. The expected genome size for either of the species is 300-350 Mb. I filter the reads to produce 5x, 10x, 20x, 25x, 30x, 35x, 40x, 50x, and all read data sets. I then run Spades to assemble each and compare statistics like number of contigs, n50, and BUSCO scores for Eukaryota, Metazoa, and Mollusca. What I find is that
For 5x to 25x or 30x read coverage of an estimage 300 mB genome, the number of contigs decreases (400,000s down to 180,000s) and the number of BUSCO complete increases (from <10% up to almost 70%)
HOWEVER
one step up in coverage from 25x to 30x (Lottia scutum) or from 30x to 35x (Lottia digitalis) and the number of contigs explodes to ~2 million and increase up to 5 million with increasing numbers of reads. At the same time BUSCO scores drop precipitously to around 35% and can go lower with further increases in reads (down to upper 20s at worst). You can see examples for Lottia digitalis attached.
Genome assembly stats - Sheet1.pdf
The n50s (and AuNs) are more consistent/more like what I expected over the coverage series but then the assemblies are all fairly fragmented so not sure how useful they are as stats.
We sequenced wild-caught animals - so might have high heterozygosity - I guess this could be causing things to fragment but not parallel assemble haplotypes in Spades (contigs increase and BUSCO single complete decreases but BUSCO duplicates do not increase at the same time - vs if haplotypes co-assembled I would expect BUSCO duplicates to be high) - and even if this is a effect of heterozygosity, I don't see why it should happen so dramatically as what seems to be a kind of inflection point.
So not sure what to make of it - if its the informatics x artifacts or biology of the genomes at play - or if I've simply done something wrong in running spades or in generating the read subsets - although the full read data sets without any filtering for N-rich reads or filtering for coverage fit the generally pattern - suggesting its not the pre-Spades processing that I did.
Any suggestions would be greatly appreciated! If there is a previous issue on this already, I apologize that I missed it.
Thank you :) Eric
spades.log
params.txt
SPAdes version
Spades 3.15.5
Operating System
Debian Buster v10.X. CPU Count: 64 : "GenuineIntel Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz (2 chips x 16 cores : 32 hyperthread cores)"
Python Version
Python 3.11.4
Method of SPAdes installation
Conda
No errors reported in spades.log
- [X] Yes