scikit-learn-extra
scikit-learn-extra copied to clipboard
Notebook for Min-Max linkage #93
Create Tutorial Notebook for Min-Max linkage and provide its naive implementation, corresponding to #93
Hello,
This is my first time trying to contribute to scikit-learn-extra. It was hard for me to understand the source code. Therefore, as the first step, I created a notebook and provided a naive implementation. I believe this can be helped us with the unit test.
I was wondering if someone can guide me through reading the source code and help me implement this for the scikit-learn-extra package.
Hello,
Thank you this looks interesting.
I think, given the subject, you can try to understand the outline of the code of KMedoids and do the same, the idea is to stay in line with scikit-learn code.
To help you here is the skeleton I would use if I were to make a code for your notebook in a file sklearn_extra/cluster/agglomerative_clustering_minmax.py
for instance:
class Agglemorative_clustering_minmax(BaseEstimator, ClusterMixin, TransformerMixin):
"""
# Add short description here
Parameters
----------
X: array-like, shape (n_samples, n_features) or (n_samples, n_samples)
affinity: "precomputed" or "euclidean" (default)
Metric used to compute the distance between any two samples
n_clusters: int, default=2
The number of clusters to find.
Attributes
----------
# Add attribute (for clusterMixin you must at least have labels_
and n_iter_ . I don't think there are any centers or inertia in your definition
of clustering so at first just labels_ and n_iter_ is alright.
Examples
--------
# Add a short example of use
References
----------
# Add the article
"""
def __init__(
self,
n_clusters=8,
metric="euclidean",
max_iter=300,
):
self.n_clusters = n_clusters
self.metric = metric # instead of affinity I would use metric to be homogeneous with KMedoids
self.max_iter = max_iter # this is a bound on the number of iterations. In order not to have infinite loop if there is some bug
def fit(self, X, y=None):
"""Fit Agglemorative clustering to the provided data.
Parameters
----------
X : {array-like, sparse matrix}, shape = (n_samples, n_features), \
or (n_samples, n_samples) if metric == 'precomputed'
Dataset to cluster.
y : Ignored
Returns
-------
self
"""
# Put your code here to compute labels_
# labels_ is what you called "clusters", i.e. the output of your algorithm
# and n_iter_ the number of iterations used
return self
def transform(self, X):
"""Transforms X to cluster-distance space.
Parameters
----------
X : {array-like, sparse matrix}, shape (n_query, n_features), \
or (n_query, n_indexed) if metric == 'precomputed'
Data to transform.
Returns
-------
X_new : {array-like, sparse matrix}, shape=(n_query, n_clusters)
"""
# I am not certain that it is alright to do it like this but here is what
# I would do : output the result of the r function for each point.
# i.e. the min_{x \in C} d_{max}(x, C)
return r(X)
def predict(self, X):
"""Predict the closest cluster for each sample in X.
Parameters
----------
X : {array-like, sparse matrix}, shape (n_query, n_features), \
or (n_query, n_indexed) if metric == 'precomputed'
New data to predict.
Returns
-------
labels : array, shape = (n_query,)
Index of the cluster each sample belongs to.
"""
# Add the cluster computation here, can be used on new points not only
# on the training data.
return clusters
It would be also good to make some additional functions in the class like maybe a function that does the update of the distance matrix.
Once the code is done, you would also have to
- Add the name of the algorithm in
sklearn-extra/cluster/__init__.py
- Make some documentation for the code (in the file
doc/modules/cluster.rst
) - Add your code to the API doc (in file
doc/api.rst
) - Make some test in
sklearn-extra/cluster/tests
- Make some example (the naive example would be good and in addition an example where you show what is particular about this algo compared to other clustering algorithms)
Don't hesitate to ask if you have any question. I wrote part of KMedoids code so I understand it pretty well.
EDIT : most of the code is a suggestion not an official guideline, feel free to improve the basic ideas I gave.
@TimotheeMathieu Thanks a lot for your input. You are right. I might take a look at Agglemorative Clustering in scikit-learn as well and see how the clusters are stored at each cut.
I start with your guideline. Thanks!!