scikit-learn-extra Notebook for Min-Max linkage #93

Create Tutorial Notebook for Min-Max linkage and provide its naive implementation, corresponding to #93

Hello,

This is my first time trying to contribute to scikit-learn-extra. It was hard for me to understand the source code. Therefore, as the first step, I created a notebook and provided a naive implementation. I believe this can be helped us with the unit test.

I was wondering if someone can guide me through reading the source code and help me implement this for the scikit-learn-extra package.

Jul 30 '21 23:07 NimaSarajpoor

Hello,

Thank you this looks interesting.

I think, given the subject, you can try to understand the outline of the code of KMedoids and do the same, the idea is to stay in line with scikit-learn code. To help you here is the skeleton I would use if I were to make a code for your notebook in a file sklearn_extra/cluster/agglomerative_clustering_minmax.py for instance:

class Agglemorative_clustering_minmax(BaseEstimator, ClusterMixin, TransformerMixin):
    """
    # Add short description here
    Parameters
    ----------
    X: array-like, shape (n_samples, n_features) or (n_samples, n_samples)
    
    affinity: "precomputed" or "euclidean" (default) 
    Metric used to compute the distance between any two samples
    
    n_clusters: int, default=2
    The number of clusters to find. 
    Attributes
    ----------
    # Add attribute (for clusterMixin you must at least have labels_ 
     and n_iter_ . I don't think there are any centers or inertia in your definition 
     of clustering so at first just labels_ and n_iter_ is alright.
    Examples
    --------
    # Add a short example of use
    References
    ----------
    # Add the article
    """
    def __init__(
            self,
            n_clusters=8,
            metric="euclidean",
            max_iter=300,
        ):
            self.n_clusters = n_clusters 
            self.metric = metric # instead of affinity I would use metric to be homogeneous with KMedoids
            self.max_iter = max_iter # this is a bound on the number of iterations. In order not to have infinite loop if there is some bug
    def fit(self, X, y=None): 
        """Fit Agglemorative clustering to the provided data.
        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = (n_samples, n_features), \
                or (n_samples, n_samples) if metric == 'precomputed'
            Dataset to cluster.
        y : Ignored
        Returns
        -------
        self
        """
        # Put your code here to compute labels_
        # labels_ is what you called "clusters", i.e. the output of your algorithm
        # and n_iter_ the number of iterations used
        return self

    def transform(self, X):
        """Transforms X to cluster-distance space.
        Parameters
        ----------
        X : {array-like, sparse matrix}, shape (n_query, n_features), \
                or (n_query, n_indexed) if metric == 'precomputed'
            Data to transform.
        Returns
        -------
        X_new : {array-like, sparse matrix}, shape=(n_query, n_clusters)
        """
        # I am not certain that it is alright to do it like this but here is what 
        # I would do : output the result of the r function for each point. 
        # i.e. the min_{x \in C} d_{max}(x, C)
        return r(X)

    def predict(self, X):
        """Predict the closest cluster for each sample in X.
        Parameters
        ----------
        X : {array-like, sparse matrix}, shape (n_query, n_features), \
                or (n_query, n_indexed) if metric == 'precomputed'
            New data to predict.
        Returns
        -------
        labels : array, shape = (n_query,)
            Index of the cluster each sample belongs to.
        """
        # Add the cluster computation here, can be used on new points not only 
        # on the training data. 
        return clusters

It would be also good to make some additional functions in the class like maybe a function that does the update of the distance matrix.

Once the code is done, you would also have to

Add the name of the algorithm in sklearn-extra/cluster/__init__.py
Make some documentation for the code (in the file doc/modules/cluster.rst)
Add your code to the API doc (in file doc/api.rst)
Make some test in sklearn-extra/cluster/tests
Make some example (the naive example would be good and in addition an example where you show what is particular about this algo compared to other clustering algorithms)

Don't hesitate to ask if you have any question. I wrote part of KMedoids code so I understand it pretty well.

EDIT : most of the code is a suggestion not an official guideline, feel free to improve the basic ideas I gave.

Jul 31 '21 07:07 TimotheeMathieu

@TimotheeMathieu Thanks a lot for your input. You are right. I might take a look at Agglemorative Clustering in scikit-learn as well and see how the clusters are stored at each cut.

I start with your guideline. Thanks!!

Aug 01 '21 03:08 NimaSarajpoor

scikit-learn-extra scikit-learn-extra copied to clipboard

Notebook for Min-Max linkage #93

scikit-learn-extra
scikit-learn-extra copied to clipboard