scikit-learn-extra icon indicating copy to clipboard operation
scikit-learn-extra copied to clipboard

[NEW FEATURE] Adding min-max linkage to Agglomerative Hierarchical Clustering

Open NimaSarajpoor opened this issue 3 years ago • 0 comments

I mentioned it on scikit-learn, and someone suggested mentioning it here on extra library as it seems a more appropriate place to add this new feature.

I think many people (who are on the application side of ML) use scikit-learn for their projects and are not aware of many algorithms that are available. Thus, implementing a new algorithm can let them know such a method exists. Please let me know if it is a good idea to add this method to the Agglomerative Clustering. If so, I can work on it (as I already implemented it myself) and add it to the package.

======================================

Challenge: The linkages in agglemorative clustering that can work with the distance matrix are complete, linkage, and average. However, the resulted clusters don't have a clear centroid. So, a method that can work with a distance matrix and provide a centroid-based structured clusters can provide a lot of flexibility. It can be used for any distance measure and since it utilizes the distance matrix only, it can be performed very fast for large-size data. It also provides centroids which can be used as the representatives of the clusters.

Solution: A min-max linkage proposed by Jacob Bien & Robert Tibshirani (2011) is a good solution. It merges two clusters if a ball that is required to cover the points of the two clusters together is the smallest compare to others. A ball for each cluster can be identified as follows: First, identify the furthest neighbor of each point and calculate its distance. Then, choose the point (M) that has the shortest distance to its furthest neighbor. That point M is the centroid and that shortest distance is the radius of the ball. I recently used it for my paper where I had to use DTW distance, but I also need a fast algorithm that provides a reasonable centroid rather than computing DBA at each step of the process.

Alternative solution: N/A

Additional Info: Paper: Bien, Jacob, and Robert Tibshirani. "Hierarchical clustering with prototypes via minimax linkage." Journal of the American Statistical Association 106.495 (2011): 1075-1084.

NimaSarajpoor avatar Mar 21 '21 17:03 NimaSarajpoor