medaka icon indicating copy to clipboard operation
medaka copied to clipboard

Medaka fails to accurately call deletions

Open BertVanm opened this issue 10 months ago • 2 comments

Describe the bug I'm using Medaka 2.0.1 to polish a series of SARS2 genomes but it fails to accurately call the deletions that are present. I run the following code: ``` medaka_consensus -d ../support_files/${variantreferencename}.fasta -i ../../hq_reads/${barcode}.hq.fastq.gz -m ${medakamodel} -o ${barcode}${variantreferencename} medaka inference ${barcode}${variantreferencename}/calls_to_draft.bam ${barcode}${variantreferencename}/inference_probs.hdf --model ${medakamodel} medaka vcf ${barcode}${variantreferencename}/inference_probs.hdf ../support_files/${variantreferencename}.fasta ${barcode}_${variantreferencename}/medaka.annotated.vcf ```

In the resulting vcf's, all SNPs seem to be called correctly, but deletions not so much. Below is a snippet of the same deletion in six different samples:

barcode01 NC_045512.2 11285 TTGTC T 37.733 barcode02 NC_045512.2 11287 GTCTGGTTTT G 55.802 barcode03 NC_045512.2 11286 TGTCTG T 40.81 barcode04 NC_045512.2 11285 TTGTC T 33.729 barcode05 NC_045512.2 11285 TTGTC T 49.564 barcode06 NC_045512.2 11287 GTCTGGTTTT G 172.533

Where barcode 02 and 06 are called correctly, but the four others are wrong. For example, this is how the read mapping against the medaka consensus for barcode04 looks like at this position:

Image

I see the same thing for a smaller 7-bp deletion and a larger 300-bp deletion elsewhere in the genome, where in some samples they are called correctly and in others not. The coverage at these positions is in the 1000's, so that should not be the issue. I've also tried to play around with some of the settings of medaka, but to no avail.

Is this the expected behavior for medaka? Do you have any suggestions how I might solve this?

BertVanm avatar Feb 28 '25 11:02 BertVanm

Hi all,

I have the same issue with a 30bp deletion (position 100-130 roughly) that is clearly visible in the alignment , has super high coverage, yet gets filled in when making a consensus with medaka. I run the medaka 1.12 but have also tried latest release

medaka 2.0 medaka inference p27.sorted.bam output.hdf medaka sequence output.hdf ref.fasta consensus.fasta --> deletion not shown in consensus

medaka 1.12 medaka consensus --model r1041_e82_400bps_hac_v5.0.0 p27.sorted.bam output.hdf medaka variant ref.fasta output.hdf output.vcf --> deletion not shown in vcf file

when I run 1.12 with a different model namely, r1041_e82_400bps_hac_g615, then the deletion shows up for some of my bam files but not for p27...however I ran basecalling with r1041_e82_400bps_hac_v5.0.0, not sure what the g615 means.

happy to share files but github won't allow

thanks

Phil

PhilliVanilli avatar Mar 06 '25 09:03 PhilliVanilli

If my data was basecalled with Dorado version 0.9.1 with super accurate (sup) model and Flowcell: FLO-MIN114 (R10.4.1) was used so what code should I use to run medaka. or I can no longer use it in 2025 ? my genome is 40mb and its a fungus.

Agridibuu avatar Apr 29 '25 06:04 Agridibuu