dashing2 icon indicating copy to clipboard operation
dashing2 copied to clipboard

aminoacid distance to AAI?

Open jianshu93 opened this issue 2 years ago • 2 comments

Hello Daniel,

For nt Jaccard distance, estimated by either MinHash (e.g. probminhash) , we can follow the MASH paper to do a log function transformation (-1/k*(2log(J)/(log(J)+1))) to approximate ANI, what if it is the Jaccard distance of amino acid/preotein sequences? We should make some adjustment to it right to approximate AAI (average amino acid identity)?

Thanks,

Jianshu

jianshu93 avatar Jun 07 '22 14:06 jianshu93

Hi Jianshu,

You should be able to use the same equation converting k-mer similarity fraction to ANI and for AAI, substituting the relevant statistics.

Specifically:

1 + log(2*J/(1+J)) / k

For Python code, you might perform something like:

amino_jaccards = # somehow set the vector of Jaccard similarities, parsing or otherwise
est_amino_identity = 1. + np.log(2 * amino_jaccards / (1. + amino_jaccard)) / k

This transformation is really all you need. Also, in my experiments, weighted Jaccard (probminhash or bagminhash) can yield some more accurate ANI estimates than set-based Jaccard (albeit slower/more memory); depending on the nature of the data, it might be worth trying weighted extensions.

Thanks,

Daniel

dnbaker avatar Jun 08 '22 17:06 dnbaker

thanks daniel.This is very helpful.

jianshu

jianshu93 avatar Jun 09 '22 14:06 jianshu93