hdbscan icon indicating copy to clipboard operation
hdbscan copied to clipboard

HDBScan performance issue with large dataset

Open divya-agrawal3103 opened this issue 7 months ago • 3 comments

Hi Team,

We are currently running the HDBSCAN algorithm on a large and diverse dataset using one of our products to execute the script in Python. Below is the script we are using along with the input data:

from datetime import datetime
import pandas as pd
import modelerpy
modelerpy.installPackage('scikit-learn')
import sklearn
modelerpy.installPackage('cython')
modelerpy.installPackage('hdbscan')
import hdbscan
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, OneHotEncoder
import pkg_resources
from sklearn.decomposition import PCA
 
data = pd.read_csv("sample.csv")
cluster_data = data.drop(['Customer ID', 'Campaign ID', 'Response'], axis=1)
categorical_features = ['Gender', 'Marital Status']
numeric_features = list(set(cluster_data.columns) - set(categorical_features))
preprocessor = ColumnTransformer(transformers=[
    ('cat', OneHotEncoder(), categorical_features),
    ('numeric', RobustScaler(), numeric_features)
], remainder='passthrough')
normalized = preprocessor.fit_transform(cluster_data)
normalized_df = pd.DataFrame(normalized, columns=preprocessor.get_feature_names_out())
pca = PCA(n_components=2)
pca_result = pca.fit_transform(normalized_df)
print('build model start')
print(datetime.now().time())
try:
 model = hdbscan.HDBSCAN(
    min_cluster_size=1000,
    min_samples=5,
    metric="euclidean",
    alpha=1.0,
    p=1.5,
    algorithm="prims_kdtree",
    leaf_size=30,
    approx_min_span_tree=True,
    cluster_selection_method="eom",
    allow_single_cluster=False,
    gen_min_span_tree=True,
    prediction_data=True
    ).fit(pca_result)
 print('build model end') 
 print(datetime.now().time())
 #print(model)
 print("Cluster labels:")
 print(model.labels_)
 print("\nNumber of clusters:")
 print(len(set(model.labels_)) - (1 if -1 in model.labels_ else 0))
 print("\nCluster membership probabilities:")
 print(model.probabilities_)
 print("\nOutlier scores:")
 print(model.outlier_scores_)
except Exception as e:
  # Code to handle any exception
  print(f"An error occurred: {e}")

Sample file- sample.csv

We have performed preprocessing steps including OneHotEncoding, Scaling, and Dimensionality Reduction. The script executes in approximately 8 minutes. However, switching the algorithm from "prims_kdtree" to "best", "boruvka_kdtree", or "boruvka_balltree" results in a failure within a few minutes with the error message:

"An error occurred: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by excessive memory usage causing the Operating System to kill the worker."

Note: When executing the script using Jupyter Notebook, we obtain results for "best", "boruvka_kdtree", "boruvka_balltree", "prims_balltree", and "prims_kdtree" algorithms within a reasonable time.

Could you please help us with the following questions?

  1. Why do "best", "boruvka_kdtree", and "boruvka_balltree" algorithms fail while "prims_balltree" and "prims_kdtree" do not?
  2. What are the recommended best practices for optimizing HDBSCAN algorithm performance with large and varied datasets?
  3. Does HDBSCAN support spilling to disk?

Your insights and guidance would be greatly appreciated.

divya-agrawal3103 avatar Jul 12 '24 07:07 divya-agrawal3103