scikit-learn-extra
scikit-learn-extra copied to clipboard
[NEW FEATURE] Adding min-max linkage to Agglomerative Hierarchical Clustering
I mentioned it on scikit-learn, and someone suggested mentioning it here on extra library as it seems a more appropriate place to add this new feature.
I think many people (who are on the application side of ML) use scikit-learn for their projects and are not aware of many algorithms that are available. Thus, implementing a new algorithm can let them know such a method exists. Please let me know if it is a good idea to add this method to the Agglomerative Clustering. If so, I can work on it (as I already implemented it myself) and add it to the package.
======================================
Challenge: The linkages in agglemorative clustering that can work with the distance matrix are complete, linkage, and average. However, the resulted clusters don't have a clear centroid. So, a method that can work with a distance matrix and provide a centroid-based structured clusters can provide a lot of flexibility. It can be used for any distance measure and since it utilizes the distance matrix only, it can be performed very fast for large-size data. It also provides centroids which can be used as the representatives of the clusters.
Solution: A min-max linkage proposed by Jacob Bien & Robert Tibshirani (2011) is a good solution. It merges two clusters if a ball that is required to cover the points of the two clusters together is the smallest compare to others. A ball for each cluster can be identified as follows: First, identify the furthest neighbor of each point and calculate its distance. Then, choose the point (M) that has the shortest distance to its furthest neighbor. That point M is the centroid and that shortest distance is the radius of the ball. I recently used it for my paper where I had to use DTW distance, but I also need a fast algorithm that provides a reasonable centroid rather than computing DBA at each step of the process.
Alternative solution: N/A
Additional Info: Paper: Bien, Jacob, and Robert Tibshirani. "Hierarchical clustering with prototypes via minimax linkage." Journal of the American Statistical Association 106.495 (2011): 1075-1084.