k-means-constrained
k-means-constrained copied to clipboard
Allow top-level customisation of the distance metric used for clustering
The current version of k-means-constrained uses the Euclidean distance to calculate the distance between points in the dataset. Issue #17 highlighted the potential to make this a customisable parameter, and this pull requests introduces the functionality following the advice suggested in that issue.
What's new?
euclidean_distancesis no longer hard-coded into the clustering calculations- Introduce a
distance_metricargument which can be passed to the top-level clustering class and propagates through to the optimisation routine - Defaults to
euclidean_distancefunction but enables users to change this to any other function of their choosing
Hi @bibblybobblyben, thanks for your interest and contribution! 🙂
You will also need to change how the centroids are determined. Currently, a mean of all the data points in a cluster is calculated to get the centroid, but this only makes sense if you are considering Euclidean distances.
The centroids are initially calculated here: https://github.com/joshlk/k-means-constrained/blob/65e22120d37e7c878d91d7fc73cb6380d581ec4f/k_means_constrained/k_means_constrained_.py#L303
And then here: https://github.com/joshlk/k-means-constrained/blob/65e22120d37e7c878d91d7fc73cb6380d581ec4f/k_means_constrained/k_means_constrained_.py#L340
Please also add some tests.
Looking forward to merging this feature 🤩 !
Josh
Thanks for the comments and feedback - will work to implement them 🙂
Good point on the centroids - do you think the best course of action is editing the _k_init function from k_means_constrained.sklearn_import.cluster.k_means_ so that it can also take a custom distance metric? Other keyword arguments in init_centroids don't seem affected by the change of distance metric?
For the changes involving _centers_dense and _centers_sparse, do I edit the functions in _k_means.pyx ? I am not familiar with using cython.
Can we discuss the motivation to edit those functions please? As I understand it - it looks at all samples that have been pre-allocated to a cluster number, and then calculates the mean coordinate for each point within that cluster. As they are already allocated by the _labels_constrained function, which will incorporate the distance metric, is this calculation already implicitly using the metric? Or do you think we need to do something smarter to define the centre of each cluster? Something to minimise the mean distance to every point in that cluster?