hdbscan icon indicating copy to clipboard operation
hdbscan copied to clipboard

Return exemplar indices in data instead of actual points.

Open hammadmazhar1 opened this issue 6 years ago • 3 comments

Hi, Using HDBSCAN to cluster network traffic data. I see that there is a way to retrieve the exemplar points representing the clusters. Would it be possible to instead return the indices for these points in the provided data, so that it is easier to correlate with raw data (I do some transformations to create numerical representations of hostnames). If there is already a method to do so, please let me know.

hammadmazhar1 avatar Jun 05 '19 14:06 hammadmazhar1

It is certainly possible -- I don't have the exact approach to hand right now, but if you look at the code in the prediction.py file you'll see how the exemplar extraction is done, and it should be relatively straightforward to adapt that.

lmcinnes avatar Jun 07 '19 14:06 lmcinnes

@hammadmazhar1

Not sure if you figured this out or not, but here's a code snippet that works for me. In the following, clusterer is an hdbscan cluster object. The output is a list of indexes (starting at 0) of the exemplars for the specific cluster.

selected_clusters = clusterer.condensed_tree_._select_clusters()
raw_condensed_tree = clusterer.condensed_tree_._raw_tree

exemplars = []
for cluster in selected_clusters:
    
    cluster_exemplars = np.array([], dtype=np.int64)
    for leaf in clusterer._prediction_data._recurse_leaf_dfs(cluster):
        leaf_max_lambda = raw_condensed_tree['lambda_val'][
            raw_condensed_tree['parent'] == leaf].max()
        points = raw_condensed_tree['child'][
            (raw_condensed_tree['parent'] == leaf) &
            (raw_condensed_tree['lambda_val'] == leaf_max_lambda)]
        cluster_exemplars = np.hstack([cluster_exemplars, points])
    exemplars.append(cluster_exemplars)```

jsgroob avatar Aug 01 '19 19:08 jsgroob

@hammadmazhar1

Not sure if you figured this out or not, but here's a code snippet that works for me. In the following, clusterer is an hdbscan cluster object. The output is a list of indexes (starting at 0) of the exemplars for the specific cluster.

selected_clusters = clusterer.condensed_tree_._select_clusters()
raw_condensed_tree = clusterer.condensed_tree_._raw_tree

exemplars = []
for cluster in selected_clusters:
    
    cluster_exemplars = np.array([], dtype=np.int64)
    for leaf in clusterer._prediction_data._recurse_leaf_dfs(cluster):
        leaf_max_lambda = raw_condensed_tree['lambda_val'][
            raw_condensed_tree['parent'] == leaf].max()
        points = raw_condensed_tree['child'][
            (raw_condensed_tree['parent'] == leaf) &
            (raw_condensed_tree['lambda_val'] == leaf_max_lambda)]
        cluster_exemplars = np.hstack([cluster_exemplars, points])
    exemplars.append(cluster_exemplars)```

For anyone meet this error: AttributeError: 'NoneType' object has no attribute '_recurse_leaf_dfs', You can try add below code at the beginning of @jsgroob's code, it works for me :).

if clusterer._prediction_data is None:
    clusterer.generate_prediction_data()

Humbertzhang avatar Sep 15 '22 07:09 Humbertzhang