miniprot
miniprot copied to clipboard
Filter out very long intron gene (max intron size)
Hi, @lh3
It seems a lots of long-intron protein mapping in the miniprot
result, can we use some parameters to filter out these? Its size smaller than the default -G
200k. Did it confused by the context of different gene?
data:image/s3,"s3://crabby-images/8b147/8b14701f98b299237871ae9f849fd97f10bc153b" alt="image"
Liftoff was the evidence using existing annotation.
Terminal exons are problematic as they are sometimes too short to be aligned accurately. It is not possible to get high sensitivity and high specificity at the same time based on a single protein. You may filter a terminal exon at low alignment score but you will end up with an incomplete CDS.
For the purpose of gene prediction, you need to integrate signals from multiple proteins. You can choose the best alignment for each protein (at the cost of missing gene duplications). When there are multiple hits in a region, choose the hit with a better score or at higher identity.
In general, it is not advised to take raw protein alignment as the final annotation, just as we have to run something like stringtie to annotate a genome from RNA-seq read alignment.
Thanks for the prompt reply and nice advices. I will take next extra filter step with miniprot gff3. I am totally agree with you about the annotation step. Cross-species protein alignment give signals for isoforms expressed in specific conditions since the cost of comperhensive RNA-Seq. I just need a protein alignment layer which have good tradoff between specificity and sensetity.
I will keep this issue open. Probably many users will have a similar question. I do need to tune parameters more carefully for terminal exons in future. I am also thinking to write a tool for filtering but that won't happen soon.
By the way, what query proteins were you using? How many proteins?
For reference, it was hifiasm-based Arabidopsis thaliana. The query was taken from a TE annotation tool which they use for filter flase TE by protein-coding gene (https://github.com/oushujun/EDTA/blob/master/database/alluniRefprexp082813). It consists of 102,447 proteins from different plants. Another protein dataset I typically use was the swiss-prot plant part (~40,000 hints with review).
I see. If there are multiple proteins mapped to the same locus, you may filter out the proteins at lower alignment score (6th column in GFF) or at lower identity (the Identity
tag). Distant proteins are harder to be mapped correctly.
Great!! I will filter out by that tag. But another issue still exists. Since the various annotation quality of different species assembly, it's hard to make a tradeoff between the closest and best quality protein. So I prefer to use various protein dataset or use manual reviewed protein as evidence for annotation.
What's your recommend divergence that miniprot
can handle? Or maybe add some presets like minimap2 -asm5/10/20
to change the thresold for divergence protein mapping?