dashing icon indicating copy to clipboard operation
dashing copied to clipboard

Dendogram

Open JChristopherEllis opened this issue 3 years ago • 2 comments

Can you create a dendrogram from the dist results?

Also, could you recommend parameters for large fungal genome comparison?

JChristopherEllis avatar Apr 16 '21 20:04 JChristopherEllis

Hi,

Sure, you can do that.

You'd start with a distance or similarity matrix, and then feed that into a hierarchical clustering algorithm. Good options could include scipy's hierarchical clustering (https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html) or HDBSCAN, both of which can work on distance matrices.

For parameter election, the k will depend on how similar the genomes are. 16-19 seems to be good for generating pairwise distance across all fungal genomes in RefSeq, but if you're working with many related strains you may want something more like 30-100.

An example workflow with Scipy's Hierarchical Clustering you might follow:

import numpy as np
import scipy.cluster.hierarchy as sch
import matplotlib.pyplot as plt

x = ... # Parse distance matrix from file somehow
# If square, convert to condensed distance matrix from scipy.cluster.hierarchy
if x.ndim > 1:
    from scipy.spatial.distance import squareform
    x = squareform(x)

L = sch.linkage(x)
dn = sch.dendrogram(L)

You can then export the dendrogram or visualize it with matplotlib. (fig.show after creating the dendrogram should show it.)

The downside to this is that it only works for symmetric distances in SciPy, though you should be able to use containment distance with HBDSCAN. Of course, you can convert any similarity measure (containment, jaccard) into a distance by using 1 - x for the similarity, or you can use the Mash formula to convert a Jaccard into a distance (log((2 * x) / (1 + x)) / k).

Spectral Clustering, for instance, will use affinities rather than distances.

I hope this helps, and let me know if you have any further questions or problems. Thanks,

Daniel

dnbaker avatar Apr 20 '21 00:04 dnbaker

Quicktree also performs quite well

sed -i "1s/.*/$FILECOUNT/" $dashingDistanceMatrix
quicktree -in m $dashingDistanceMatrix > $newick # NJ-tree, https://github.com/khowe/quicktree
nw_reroot $newick > final.nwk # quick and dirty rooting, http://cegg.unige.ch/newick_utils

mihkelvaher avatar Apr 23 '21 10:04 mihkelvaher