NextDenovo icon indicating copy to clipboard operation
NextDenovo copied to clipboard

Decreased N50 with higher sequencing depth

Open nadegeguiglielmoni opened this issue 4 years ago • 10 comments

Hello,

I have been running some tests with NextDenovo 2.2 on one genome for which I have high coverages of PacBio and Nanopore reads. For both datasets separately, I tried subsampling the reads to different sequencing depths (10X, 20X... 100X). I found that at a 40-50X I would have the highest N50, but then with higher sequencing depths the N50 decreased. As the species is diploid with variable levels of heterozygosity, including some regions with high levels of heterozygosity, my hypothesis is that a higher sequencing depth gives more support to alternative haplotypes, and leads to breaks in the assembly. Could you give me some insights?

nadegeguiglielmoni avatar Mar 02 '21 11:03 nadegeguiglielmoni

Hi, could you provide your config files? BTW, you should update to the latest version.

moold avatar Mar 03 '21 00:03 moold

We have updated NextDenovo for future projects.

Here is the config file:

[General]
job_type = local
job_prefix = ND_ont
task = assemble # 'all', 'correct', 'assemble'
rewrite = yes # yes/no
deltmp = yes
rerun = 10
parallel_jobs = 10
input_type = raw
input_fofn = ./input.fofn
workdir = ./run

[assemble_option]
minimap2_options_raw = -x ava-ont -t 10
random_round = 20
minimap2_options_cns = -x ava-ont -t 8 -k17 -w17
nextgraph_options = -a 1
seed_cutoff = HereSeedCutoff

nadegeguiglielmoni avatar Mar 03 '21 14:03 nadegeguiglielmoni

How about the seed_cutoff value for different depths?

moold avatar Mar 03 '21 15:03 moold

We set it to 1001.

nadegeguiglielmoni avatar Mar 03 '21 16:03 nadegeguiglielmoni

OK, I think this may be the core of the problem,you can try to calculate seed_cutoff value using bin/seq_stat. see #103 . Usually, the assembly quality is affected by the reads length, not the depth.

moold avatar Mar 04 '21 01:03 moold

Ok thank you, I will try optimizing the seed cutoffs.

nadegeguiglielmoni avatar Mar 04 '21 10:03 nadegeguiglielmoni

Hello,

We ran the assemblies again with more adapter seed cutoffs. For PacBio assemblies, there is little change. For Nanopore assemblies, there is still a drop in N50 at 60X. The N50 is better for assemblies at 80X and 100X, but the BUSCO score is drastically decreased compared to previous assemblies.

nadegeguiglielmoni avatar Mar 08 '21 10:03 nadegeguiglielmoni

Thanks for your feedback, the assembly quality is not simply linear with the depth and length of the input data, it also depends on the characteristics of the genome. But, the BUSCO score should be similar, so could you share more details (assembly options and busco values) about the BUSCO score is drastically decreased compared to previous assemblies..

moold avatar Mar 09 '21 01:03 moold

Hello,

The parameters were the same as before, except for seed cutoff.

Here are the results I had before with Nanopore reads: 40X: N50 = 11.5-14.5 Mb, single BUSCOs = 312-388, duplicated BUSCOs = 12-24 50X: N50 = 11.0-13.8 Mb, single BUSCOs = 362-393, duplicated BUSCOs = 14-27 60X: N50 = 4.7-8.1 Mb, single BUSCOs = 668-685, duplicated BUSCOs = 79-98 80X: N50 = 4.1-10.1 Mb, single BUSCOs = 665-695, duplicated BUSCOs = 78-91 100X: N50 = 2.6-7.0 Mb, single BUSCOs = 663-683, duplicated BUSCOs = 80-105

And here are the results with an "improved" seed cutoff: 40X: N50 = 11.6-14.7 Mb, single BUSCOs = 319-392, duplicated BUSCOs = 12-24 50X: N50 = 10.8-14.8 Mb, single BUSCOs = 348-386, duplicated BUSCOs = 19-23 60X: N50 = 6.0-8.8 Mb, single BUSCOs = 674-694, duplicated BUSCOs = 72-87 80X: N50 = 10.0-13.7 Mb, single BUSCOs = 362-398, duplicated BUSCOs = 19-31 100X: N50 = 10.7-12.4 Mb, single BUSCOs = 404-420, duplicated BUSCOs = 25-43

nadegeguiglielmoni avatar Mar 09 '21 11:03 nadegeguiglielmoni

Hi, Could you provide the estimated genome size and assembly size? Do you randomly subsample reads or just select the top longest reads?

moold avatar Mar 11 '21 06:03 moold