miniprot icon indicating copy to clipboard operation
miniprot copied to clipboard

Filter out very long intron gene (max intron size)

Open baozg opened this issue 2 years ago • 6 comments

Hi, @lh3

It seems a lots of long-intron protein mapping in the miniprot result, can we use some parameters to filter out these? Its size smaller than the default -G 200k. Did it confused by the context of different gene?

image

Liftoff was the evidence using existing annotation.

baozg avatar Sep 20 '22 13:09 baozg

Terminal exons are problematic as they are sometimes too short to be aligned accurately. It is not possible to get high sensitivity and high specificity at the same time based on a single protein. You may filter a terminal exon at low alignment score but you will end up with an incomplete CDS.

For the purpose of gene prediction, you need to integrate signals from multiple proteins. You can choose the best alignment for each protein (at the cost of missing gene duplications). When there are multiple hits in a region, choose the hit with a better score or at higher identity.

In general, it is not advised to take raw protein alignment as the final annotation, just as we have to run something like stringtie to annotate a genome from RNA-seq read alignment.

lh3 avatar Sep 20 '22 14:09 lh3

Thanks for the prompt reply and nice advices. I will take next extra filter step with miniprot gff3. I am totally agree with you about the annotation step. Cross-species protein alignment give signals for isoforms expressed in specific conditions since the cost of comperhensive RNA-Seq. I just need a protein alignment layer which have good tradoff between specificity and sensetity.

baozg avatar Sep 20 '22 14:09 baozg

I will keep this issue open. Probably many users will have a similar question. I do need to tune parameters more carefully for terminal exons in future. I am also thinking to write a tool for filtering but that won't happen soon.

By the way, what query proteins were you using? How many proteins?

lh3 avatar Sep 20 '22 14:09 lh3

For reference, it was hifiasm-based Arabidopsis thaliana. The query was taken from a TE annotation tool which they use for filter flase TE by protein-coding gene (https://github.com/oushujun/EDTA/blob/master/database/alluniRefprexp082813). It consists of 102,447 proteins from different plants. Another protein dataset I typically use was the swiss-prot plant part (~40,000 hints with review).

baozg avatar Sep 20 '22 14:09 baozg

I see. If there are multiple proteins mapped to the same locus, you may filter out the proteins at lower alignment score (6th column in GFF) or at lower identity (the Identity tag). Distant proteins are harder to be mapped correctly.

lh3 avatar Sep 21 '22 01:09 lh3

Great!! I will filter out by that tag. But another issue still exists. Since the various annotation quality of different species assembly, it's hard to make a tradeoff between the closest and best quality protein. So I prefer to use various protein dataset or use manual reviewed protein as evidence for annotation.

What's your recommend divergence that miniprot can handle? Or maybe add some presets like minimap2 -asm5/10/20 to change the thresold for divergence protein mapping?

baozg avatar Sep 21 '22 08:09 baozg