medaka icon indicating copy to clipboard operation
medaka copied to clipboard

Reduced QV score and completeness after Medaka

Open EinarBaldvin opened this issue 2 years ago • 9 comments

Hi, I am not having any luck with Medaka polishing of my genome, it is a plant genome of around ~4 Gbp.

Describe the bug After running Medaka I am getting reduced completeness and reduced QV score or a minor increase.

I have two assemblies, one from shasta (shasta-2021-06-21) and one from Flye 2.9 (--nano-hq). The data is from both MinION and PromethION R9.4.1 and basecalled with Guppy 5.0.7 Super-Accuracy.

Mapping with Minimap2.22, all nanopore data used. minimap2 -ax map-ont -I 100G -K 24G -t $NCPUS --secondary=no ${dir}.fasta $fq > ${dir}_all2asm_SecNo.sam

Using Racon v1.4.21, all nanopore data used. racon -m 8 -x -6 -g -8 -w 500 -t $NCPUS --cudapoa-batches 50 --cudaaligner-batches 10 ${fq} ${map} ${dir}.fasta > ${dir}_1x_racon.fasta

I ran Medaka 1.4.2 in individual steps.

Align, with Minimap2.17. Only using the data from PromethION ~50x coverage. I also tried Minimap2.22 in this step. mini_align -i ${fq} -r ${asm} -m -p ${dir} -t ${NCPUS}

Consensus: medaka consensus $bam results/${out}.hdf --model r941_prom_sup_g507 --batch 500 --threads ${NCPUS}

Stitching: We had problems with the stitching but fixed it with #310 medaka stitch results/$hdf $asm $out --threads ${NCPUS} medaka stitch results/$hdf $asm ${out%%.fasta}_nofill.fasta --no-fillgaps --threads ${NCPUS}

Results

QV score

==> her_flye2.9_G5_minovlp_10k_1x_racon_and_medaka.qv <== her_flye2.9_G5_minovlp_10k 492085386 3917915071 21.958 0.00637086 her_flye2.9_G5_minovlp_10k_1x_racon 475153585 3915363391 22.117 0.00614181 her_flye2.9_G5_minovlp_10k_1x_racon_medaka 474187075 3912668454 22.1233 0.00613301 her_flye2.9_G5_minovlp_10k_1x_racon_MM2.22_medaka 474530769 3910670094 22.1176 0.00614107 her_flye2.9_G5_minovlp_10k_1x_racon_medaka_nofill 473394804 3905747423 22.1228 0.00613365

==> her_shasta_G5_k12_1x_racon.qv <== her_shasta_G5_k12 648746002 3717043782 20.4134 0.00909202 her_shasta_G5_k12_1x_racon 470097317 3719917390 21.9295 0.00641277 her_shasta_G5_k12_1x_racon_MM2.22_medaka 463572650 3720302866 21.9948 0.00631717 her_shasta_G5_k12_1x_racon_MM2.22_medaka_nofill 463534849 3719251259 21.9938 0.00631853

Completeness

==> her_flye2.9_G5_minovlp_10k_1x_racon.completeness.stats <== her_flye2.9_G5_minovlp_10k all 969948174 1271807996 76.2653 her_flye2.9_G5_minovlp_10k_1x_racon all 988166632 1271807996 77.6978 her_flye2.9_G5_minovlp_10k_1x_racon_medaka all 984839296 1271807996 77.4362 her_flye2.9_G5_minovlp_10k_1x_racon_MM2.22_medaka all 984690885 1271807996 77.4245 her_flye2.9_G5_minovlp_10k_1x_racon_medaka_nofill all 984748927 1271807996 77.4291

==> her_shasta_G5_k12_1x_racon.completeness.stats <== her_shasta_G5_k12 all 937626837 1271807996 73.7239 her_shasta_G5_k12_1x_racon all 974819462 1271807996 76.6483 her_shasta_G5_k12_1x_racon_MM2.22_medaka all 972681195 1271807996 76.4802 her_shasta_G5_k12_1x_racon_MM2.22_medaka_nofill all 972675710 1271807996 76.4798

Logging Please let me know which additional logs you would like to receive.

Environment:

  • Installation method : Source
  • OS: CentOS 7.9.2009
  • medaka version : Medaka 1.4.2
  • Python: 3.6.5
  • GPU model: Nvidia Quadro RTX 8000
  • Nvidia driver version: 450.142.00
  • CUDA version: CUDA Version: 11.0
  • cuDNN version: cuDNN 8.1.1

Any suggestions would be very welcome. Cheers, Einar

EinarBaldvin avatar Sep 17 '21 13:09 EinarBaldvin

Hi @EinarBaldvin,

Thank you for filling in the issue template, it really helps to diagnose issues.

Can you confirm how you have calculated your accuracy and completeness statistics? The completeness looks rather low; I would be more concerned with that in the first instance. Investigating and fixing the cause there will potentially explain the low apparent accuracy also.

I need to confer with colleagues who have more experience sequencing and assembling plant genomes to provide more guidance.

cjw85 avatar Sep 17 '21 13:09 cjw85

Thank you for the quick response.

I used meryl from the Merqury pipeline: merqury.sh Her.illumina.meryl $asm1 $asm2 ${out} > ${out}.log

The database is created from illumina reads (10x Genomics) from the same individual. As a "baseline" I ran meryl against an assembly I got out of supernova (from 10x Genomics) and received the following stats:

her2_hap 321333 3372434466 53.4319 4.53745e-06 her2_hap all 1174937564 1271807996 92.3833

I am also concerned about the low completeness, but busco score is ~99%. It is a diploid and the genome contains ~75% repetitive sequences.

EinarBaldvin avatar Sep 17 '21 13:09 EinarBaldvin

@EinarBaldvin do you know how heterozygous this genome is?

jts avatar Sep 17 '21 16:09 jts

It should be very low heterozygosity, 21-mer counting gives me an estimate of 0.2% with findGSE and supernova reports it at 0.01% (1 SNP every 10.3 Kbp, this one is a bit hard to believe though).

EinarBaldvin avatar Sep 17 '21 22:09 EinarBaldvin

@jts @cjw85 after polishing with short reads and hapog the completeness fixes itself somewhat, which I guess correlates directly with error rate. However, I would be very interested to get a higher initial QV before the short read polishing.

her_flye2.9_G5_minovlp_10k_1x_racon_hapog 45753500 3917735926 32.524 0.000559238 her_flye2.9_G5_minovlp_10k_1x_racon_hapog all 1201962449 1271807996 94.5082

EinarBaldvin avatar Sep 20 '21 08:09 EinarBaldvin

@EinarBaldvin your heterozygosity numbers aren't far off numbers for human genomes, which can be assembled fairly straight-forwardly with flye to obtain higher accuracies without further polishing work. Again I don't know how you are measuring it but your "75% repetitive sequence" figure sounds high.

I think the place to start is tweaking parameters in flye in the hope of improving the foundation -- use of racon and medaka on top of the assembly assume the assembly is structurally unambiguous, and I don't think that is the case here.

cjw85 avatar Sep 20 '21 08:09 cjw85

@cjw85 as for the high repetitive sequences, this is quite normal for these families of plants and 75% is actually quite average to low. The general range is between 80-90%, https://academic.oup.com/plcell/article/33/6/1888/6169005. I measured it using first jellyfish for the 21-mer count and then calculated features with findGSE and genomescope. These metrics fit well with published genome size flow cytometry and characteristics of related species.

Flye and shasta are both giving me fairly similar assemblies, there is a slight improvement with Racon in QV score and a more pronounced jump for shasta than flye. However, medaka is only giving me a similar improvement as a second round of Racon would or worse.

Assembly stats for the Flye assembly, shasta is almost identical. Which give me hope that he foundation is "correct" as these assembler use quite different approaches. Size 3.9 Gbp out of est. 4.4 Gbp (difference in size is most likely long repeats that couldn't be bridged yet) N50 15417077 N75 6965315 L50 65 L75 158

Any suggestions on the flye parameters or ways for me to evaluate the foundation better? I am now using the nano-hq option and the contigs I feed into Racon are the filtered_contigs.fasta after one round of flye internal polishing.

EinarBaldvin avatar Sep 20 '21 09:09 EinarBaldvin

@cjw85 are there any parameters I can change in medaka to make it able polish my genome? Could it be that medaka has problems with highly repetitive and/or plant genomes? As far as I can tell the foundation is as complete as I can get it; 100-250 contigs per chromosome, each chr is ~630 Mbp, 100% BUSCO scores.

EinarBaldvin avatar Sep 24 '21 06:09 EinarBaldvin

Apologies I neglected a response here. Have you managed to improve your assemblies in the mean time?

I think the central issue here is similar to that I just expressed here: https://github.com/nanoporetech/medaka/issues/330#issuecomment-982511388

For your genome with 75% repetitive sequence, alignment of the reads back to the draft assembly could be ambiguous. The could lead to either a lack of data for polishing (though medaka is fairly promiscuous in the MAPQ of reads that it uses), or erroneous alignment to incorrect copies of repeats leading to a poorly defined consensus.

cjw85 avatar Nov 30 '21 11:11 cjw85