spades Genome coverage inflection point for assembly quality (contig number and BUSCO)

Genome coverage inflection point for assembly quality (contig number and BUSCO)

Open 000generic opened this issue 2 years ago • 4 comments

Description of bug

I am using Spades for genome assembly of Illumina PE data and while it runs great to completion on each job I've done, I'm finding there is an inflection point in genome coverage after which the number of contigs dramatically increases while the number of complete BUSCO genes dramatically decreases. I'm new to using Spades and this may be a common finding but it surprised me - and I didn't see anything in the issues current or archived that I checked out. I've found the same similar result across a series of genome assemblies for two different species in the same genus (Lottia) so far.

My input data are Illumina PE reads (ILLUMINA (HiSeq X Ten) run: 83.3M spots, 25G bases) generated this spring. My pipeline going into Spades includes dropping all reads with greater than 5% Ns. The expected genome size for either of the species is 300-350 Mb. I filter the reads to produce 5x, 10x, 20x, 25x, 30x, 35x, 40x, 50x, and all read data sets. I then run Spades to assemble each and compare statistics like number of contigs, n50, and BUSCO scores for Eukaryota, Metazoa, and Mollusca. What I find is that

For 5x to 25x or 30x read coverage of an estimage 300 mB genome, the number of contigs decreases (400,000s down to 180,000s) and the number of BUSCO complete increases (from <10% up to almost 70%)

HOWEVER

one step up in coverage from 25x to 30x (Lottia scutum) or from 30x to 35x (Lottia digitalis) and the number of contigs explodes to ~2 million and increase up to 5 million with increasing numbers of reads. At the same time BUSCO scores drop precipitously to around 35% and can go lower with further increases in reads (down to upper 20s at worst). You can see examples for Lottia digitalis attached.

Genome assembly stats - Sheet1.pdf

The n50s (and AuNs) are more consistent/more like what I expected over the coverage series but then the assemblies are all fairly fragmented so not sure how useful they are as stats.

We sequenced wild-caught animals - so might have high heterozygosity - I guess this could be causing things to fragment but not parallel assemble haplotypes in Spades (contigs increase and BUSCO single complete decreases but BUSCO duplicates do not increase at the same time - vs if haplotypes co-assembled I would expect BUSCO duplicates to be high) - and even if this is a effect of heterozygosity, I don't see why it should happen so dramatically as what seems to be a kind of inflection point.

So not sure what to make of it - if its the informatics x artifacts or biology of the genomes at play - or if I've simply done something wrong in running spades or in generating the read subsets - although the full read data sets without any filtering for N-rich reads or filtering for coverage fit the generally pattern - suggesting its not the pre-Spades processing that I did.

Any suggestions would be greatly appreciated! If there is a previous issue on this already, I apologize that I missed it.

Thank you :) Eric

spades.log

30x-spades.log 35x-spades.log

params.txt

30x-params.txt 35x-params.txt

SPAdes version

Spades 3.15.5

Operating System

Debian Buster v10.X. CPU Count: 64 : "GenuineIntel Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz (2 chips x 16 cores : 32 hyperthread cores)"

Python Version

Python 3.11.4

Method of SPAdes installation

Conda

No errors reported in spades.log

[X] Yes

Aug 30 '23 06:08 000generic

spades spades copied to clipboard

Genome coverage inflection point for assembly quality (contig number and BUSCO)

Description of bug

spades.log

params.txt

SPAdes version

Operating System

Python Version

Method of SPAdes installation

No errors reported in spades.log

spades
spades copied to clipboard