hdbscan Return exemplar indices in data instead of actual points.

Hi, Using HDBSCAN to cluster network traffic data. I see that there is a way to retrieve the exemplar points representing the clusters. Would it be possible to instead return the indices for these points in the provided data, so that it is easier to correlate with raw data (I do some transformations to create numerical representations of hostnames). If there is already a method to do so, please let me know.

Jun 05 '19 14:06 hammadmazhar1

It is certainly possible -- I don't have the exact approach to hand right now, but if you look at the code in the prediction.py file you'll see how the exemplar extraction is done, and it should be relatively straightforward to adapt that.

Jun 07 '19 14:06 lmcinnes

@hammadmazhar1

Not sure if you figured this out or not, but here's a code snippet that works for me. In the following, clusterer is an hdbscan cluster object. The output is a list of indexes (starting at 0) of the exemplars for the specific cluster.

selected_clusters = clusterer.condensed_tree_._select_clusters()
raw_condensed_tree = clusterer.condensed_tree_._raw_tree

exemplars = []
for cluster in selected_clusters:
    
    cluster_exemplars = np.array([], dtype=np.int64)
    for leaf in clusterer._prediction_data._recurse_leaf_dfs(cluster):
        leaf_max_lambda = raw_condensed_tree['lambda_val'][
            raw_condensed_tree['parent'] == leaf].max()
        points = raw_condensed_tree['child'][
            (raw_condensed_tree['parent'] == leaf) &
            (raw_condensed_tree['lambda_val'] == leaf_max_lambda)]
        cluster_exemplars = np.hstack([cluster_exemplars, points])
    exemplars.append(cluster_exemplars)```

Aug 01 '19 19:08 jsgroob

@hammadmazhar1

Not sure if you figured this out or not, but here's a code snippet that works for me. In the following, clusterer is an hdbscan cluster object. The output is a list of indexes (starting at 0) of the exemplars for the specific cluster.

selected_clusters = clusterer.condensed_tree_._select_clusters()
raw_condensed_tree = clusterer.condensed_tree_._raw_tree

exemplars = []
for cluster in selected_clusters:
    
    cluster_exemplars = np.array([], dtype=np.int64)
    for leaf in clusterer._prediction_data._recurse_leaf_dfs(cluster):
        leaf_max_lambda = raw_condensed_tree['lambda_val'][
            raw_condensed_tree['parent'] == leaf].max()
        points = raw_condensed_tree['child'][
            (raw_condensed_tree['parent'] == leaf) &
            (raw_condensed_tree['lambda_val'] == leaf_max_lambda)]
        cluster_exemplars = np.hstack([cluster_exemplars, points])
    exemplars.append(cluster_exemplars)```

For anyone meet this error: AttributeError: 'NoneType' object has no attribute '_recurse_leaf_dfs', You can try add below code at the beginning of @jsgroob's code, it works for me :).

if clusterer._prediction_data is None:
    clusterer.generate_prediction_data()

Sep 15 '22 07:09 Humbertzhang