pyclustering icon indicating copy to clipboard operation
pyclustering copied to clipboard

Sampling Size

Open devshah96 opened this issue 5 years ago • 5 comments

Hi @annoviko , Does the CURE algorithm implementation sample the dataset irrespective of the dataset size? If yes, what is the sampling size? I tried various parameters including different number of clusters, however, I always get one main cluster and a single point in all the other clusters. I guess this is possible only if all the data points are sampled. My dataset dimensions are 26000 x 100.

devshah96 avatar Jun 18 '19 21:06 devshah96

Hello, @devshah96 , Clustering results depend on input parameters. I don't know how your data looks like, but in case of complex data it is not a trivial task to find proper parameters for the algorithm.

annoviko avatar Jun 19 '19 09:06 annoviko

@annoviko Thank you for the quick reply. The CURE algorithm actually subsamples the dataset instead of using the entire dataset. My question involves whether the current implementation also subsamples the dataset and then performs clustering or takes the uses the entire dataset.

devshah96 avatar Jun 19 '19 13:06 devshah96

@devshah96, each point is considered as a separate cluster at the begining and merge them step by step. Algorithm does not provide random sampling and partitioning feature if you are talking about that. But this feature helps to get rid of requirement to load all data to RAM, and it does not affect clustering process as I understand. Correct me if I am wrong.

annoviko avatar Jun 19 '19 13:06 annoviko

@annoviko I was talking about the random sampling and partitioning feature. Don't you think this step is integral in the removal of outliers? Also, does the hierarchical clustering algorithm remove clusters of very very low density when merging clusters as implemented in the paper?

devshah96 avatar Jun 19 '19 14:06 devshah96

@devshah96 , looks like it is a good point to support this feature.

annoviko avatar Jun 19 '19 14:06 annoviko