Mash icon indicating copy to clipboard operation
Mash copied to clipboard

all-vs-all kmer sharing and mash distances for a set of proteins

Open stubrown opened this issue 1 year ago • 0 comments

Hello Mash authors. I am working with mash for protein clustering. We have a large database of 8 million proteins and the current clustering method relies on several stages of all-vs-all BLAST comparisons.

I have implemented a mash distance to quantify the dispersion of proteins within a cluster. Right now I choose a centroid for the cluster with a complex process that relies on Blast e-values computed elsewhere in the workflow, and then run a mash of this central protein vs a FASTA of all other proteins. This generates a 'distance to the center' metric for each protein that is a valid measure of cluster dispersion.

Intuitively, it seems to me that it should be possible to build a single hash for a set of proteins and then extract in one operation all of the pairwise kmer sharing counts and efficiently create a matrix of distances for all comparisons among all proteins in the set. This would scale very well to millions of proteins - much better than pairwise all-vs-all BLAST. Then the matrix can be used as input for a clustering algorithm.

your thoughts on this would be very helpful [email protected]

stubrown avatar Nov 26 '24 19:11 stubrown