hdbscan
hdbscan copied to clipboard
HDBScan performance issue with large dataset
Hi Team,
We are currently running the HDBSCAN algorithm on a large and diverse dataset using one of our products to execute the script in Python. Below is the script we are using along with the input data:
from datetime import datetime
import pandas as pd
import modelerpy
modelerpy.installPackage('scikit-learn')
import sklearn
modelerpy.installPackage('cython')
modelerpy.installPackage('hdbscan')
import hdbscan
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, OneHotEncoder
import pkg_resources
from sklearn.decomposition import PCA
data = pd.read_csv("sample.csv")
cluster_data = data.drop(['Customer ID', 'Campaign ID', 'Response'], axis=1)
categorical_features = ['Gender', 'Marital Status']
numeric_features = list(set(cluster_data.columns) - set(categorical_features))
preprocessor = ColumnTransformer(transformers=[
('cat', OneHotEncoder(), categorical_features),
('numeric', RobustScaler(), numeric_features)
], remainder='passthrough')
normalized = preprocessor.fit_transform(cluster_data)
normalized_df = pd.DataFrame(normalized, columns=preprocessor.get_feature_names_out())
pca = PCA(n_components=2)
pca_result = pca.fit_transform(normalized_df)
print('build model start')
print(datetime.now().time())
try:
model = hdbscan.HDBSCAN(
min_cluster_size=1000,
min_samples=5,
metric="euclidean",
alpha=1.0,
p=1.5,
algorithm="prims_kdtree",
leaf_size=30,
approx_min_span_tree=True,
cluster_selection_method="eom",
allow_single_cluster=False,
gen_min_span_tree=True,
prediction_data=True
).fit(pca_result)
print('build model end')
print(datetime.now().time())
#print(model)
print("Cluster labels:")
print(model.labels_)
print("\nNumber of clusters:")
print(len(set(model.labels_)) - (1 if -1 in model.labels_ else 0))
print("\nCluster membership probabilities:")
print(model.probabilities_)
print("\nOutlier scores:")
print(model.outlier_scores_)
except Exception as e:
# Code to handle any exception
print(f"An error occurred: {e}")
Sample file- sample.csv
We have performed preprocessing steps including OneHotEncoding, Scaling, and Dimensionality Reduction. The script executes in approximately 8 minutes. However, switching the algorithm from "prims_kdtree" to "best", "boruvka_kdtree", or "boruvka_balltree" results in a failure within a few minutes with the error message:
"An error occurred: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by excessive memory usage causing the Operating System to kill the worker."
Note: When executing the script using Jupyter Notebook, we obtain results for "best", "boruvka_kdtree", "boruvka_balltree", "prims_balltree", and "prims_kdtree" algorithms within a reasonable time.
Could you please help us with the following questions?
- Why do "best", "boruvka_kdtree", and "boruvka_balltree" algorithms fail while "prims_balltree" and "prims_kdtree" do not?
- What are the recommended best practices for optimizing HDBSCAN algorithm performance with large and varied datasets?
- Does HDBSCAN support spilling to disk?
Your insights and guidance would be greatly appreciated.