pyclustering
pyclustering copied to clipboard
Sampling Size
Hi @annoviko , Does the CURE algorithm implementation sample the dataset irrespective of the dataset size? If yes, what is the sampling size? I tried various parameters including different number of clusters, however, I always get one main cluster and a single point in all the other clusters. I guess this is possible only if all the data points are sampled. My dataset dimensions are 26000 x 100.
Hello, @devshah96 , Clustering results depend on input parameters. I don't know how your data looks like, but in case of complex data it is not a trivial task to find proper parameters for the algorithm.
@annoviko Thank you for the quick reply. The CURE algorithm actually subsamples the dataset instead of using the entire dataset. My question involves whether the current implementation also subsamples the dataset and then performs clustering or takes the uses the entire dataset.
@devshah96, each point is considered as a separate cluster at the begining and merge them step by step. Algorithm does not provide random sampling and partitioning feature if you are talking about that. But this feature helps to get rid of requirement to load all data to RAM, and it does not affect clustering process as I understand. Correct me if I am wrong.
@annoviko I was talking about the random sampling and partitioning feature. Don't you think this step is integral in the removal of outliers? Also, does the hierarchical clustering algorithm remove clusters of very very low density when merging clusters as implemented in the paper?
@devshah96 , looks like it is a good point to support this feature.