pyclustering
pyclustering copied to clipboard
Documentation for using CURE with large datasets
I did not find any documentation for how to use CURE with large (2M+) datasets. Simply using the cure algorithm as is defined in cure.py is not feasible since the building of the queue and kd-tree itself will take significant time.
I noticed that there is a functionality for random sampling, added in response to a feature to include random sampling for CURE for this very reason. However I am not clear on how to use it.
Hi @Thalaivar ,
Yes, it is known issue, random sampling exists, but it not used by CURE algorithm, there is the same complain: #522 .
There are plans to support it for big data, but I am not ready to promise it in the next release, do not have enough "hands" to do it yet. Contribution is welcome.
In other words, right now it always loads all data to RAM and builds KD-tree.
By the way, cure.cpp/cure.hpp is used by default (C++ code is used in order to reach maximum performance), but the issue still has a place.