merqury icon indicating copy to clipboard operation
merqury copied to clipboard

False duplications

Open m-jahani opened this issue 2 years ago • 3 comments

Hello

Is there any way to find the coordinates of false duplications with k-mers found in unexpected copy numbers?

Thanks

m-jahani avatar Mar 17 '22 23:03 m-jahani

Hello, you could use this script in Merqury: https://github.com/marbl/merqury/blob/master/eval/false_duplications.sh

This looks into the 1st and 2nd peak of the spectra-cn plot, and counts what is seen more than once assuming the kmers in the 2nd peak are coming from the homozygous part of the genome and expects to be present once in the assembly.

I'd caution to double check the cutoffs used, it's in the printed result; to make sure it was chosen properly given what you see in the spectra-cn histogram.

Best, Arang

arangrhie avatar Mar 24 '22 18:03 arangrhie

Thanks for your answer,

$MERQURY/eval/false_duplications.sh calculates the percentage of false duplication while I was looking for a bed file that shows the location of those segments.

Like what has been reported in the Merqury paper:

'Positions of k-mers for mis-assembly detection ...... In addition, the k-mers found in unexpected copy numbers (i.e., false duplications) are also provided as .bed and .tdf files......'

Thanks, Mojtaba

m-jahani avatar Mar 24 '22 19:03 m-jahani

Hello Mojtaba @m-jahani ,

Apologies for the delayed reply! This totally slipped through. Thanks for reading the paper in detail!! That sentence was left in error as a mistake (Ouch!!). Apologies again for the confusion.

The false duplication track was an early feature in merqury, however was removed as the false duplications are difficult to easily quantify using simple cutoffs, and is dependent of the assembly context. As warned above, check the cutoffs to see it includes the 1 and 2 copy region properly.

I have pushed a script that generates the bed and wig files for the k-mers reported in the false_duplications.sh script: https://github.com/marbl/merqury/blob/master/eval/false_duplications_track.sh Note I have obsolete .tdf files as it is only supported in IGV. The .wig file is the replacement file for this.

Please note that true haplotype specific 2-copy kmers may be over counted. Moreover, some k-mers that exist in the genome in higher copy may have been less sampled during sequencing (which is commonly happening), and thus may fall under the cutoff.

I feel this track is somewhat useful, however has a high chance to confuse results in regions vulnerable to sequencing biases. There could be more clever ways for detecting false duplications, such as using probabilistic models accounting for surrounded genome copy number context. Hope somebody comes up with a better approach!

Best, Arang

arangrhie avatar May 15 '22 03:05 arangrhie