pydpc icon indicating copy to clipboard operation
pydpc copied to clipboard

Package shuts down on large data.

Open quantkeyvis opened this issue 5 years ago • 5 comments

Hi Dev Team, thanks for the package, its one of my favs since y'all implimenation is straightforward and even include some improvements from the paper.

My issue is that recently I used the package to cluster large corpus of text (the tf-idf of the corpus). The RAM on my computer instance is 1.5 TB so there definately is room for memory.

But what tends to happen is, if I pass an array/dataframe with more that 50k observations the program shuts down (usually at the "self.kernel_size = _core.get_kernel_size(self.distances, self.fraction" line). The data sent in is all numeric and the distances are calculated using from sklearn.metrics.pairwise.cosine_similarity.

When it shuts down it gives no error message except: "The python program has shutdown"

quantkeyvis avatar Mar 26 '20 16:03 quantkeyvis

Do you get the same issue if you try with, for example, only 10k of your data points? If you try with the last 10k or 1k or whatever or whatever of your data?

The code for guessing a kernel size allocates a scratch array that scales with the data, which can't be helping.

I would recommending trying setting kernel_size explicitly when you construct your Cluster. If you can't think of a good value, you can always use the guess algorithm on some small subset of the data:

guess_kern_size = _core.get_kernel_size(dist_matrix[:500, :500], fraction)
# ...
Cluster(
    kernel_size = guess_kern_size,
    fraction = fraction,
    # etc.
)

Linux-cpp-lisp avatar Mar 26 '20 18:03 Linux-cpp-lisp

So technically the computer should be able to handle a distance matrix of 600k by 600k. But the program only works with a matrix of almost 50k by 50k.

It works with broken up portions of the data, but the cluster centers and cluster assignments are all relative. Thus I fear breaking up the data into multiple sets to create multiple clustering as it may impact the density and distance metrics. and make it difficult or impossible to find those smaller rarer clusters.

Especially considering even clustering with 900k observations that will account for about 1% of our data.

I'm still thinking through and researching that statistical theory behind such sampling in this use case. Particularly the number of features (words) are in the thousands.

quantkeyvis avatar Mar 26 '20 18:03 quantkeyvis

I completely understand your reluctance to break it up; I was more wondering from a debugging point of view.

Linux-cpp-lisp avatar Mar 26 '20 18:03 Linux-cpp-lisp

I was not directly involved in this project, but here are my 50 cents to this topic. My suspicion is that within the C-extension we use the wrong (or lets say too small) data types for the amount and dimension of the data points. So if this number exceeds the data type horrible things will happen to the memory access and the program is terminated by the operating system. I think we should exchange the data type for N (amount of data) and dim(ension) with size_t instead of plain (signed) int. Unfortunately I do not have the time to do this, but it should be a pretty straight forward change.

marscher avatar Jun 04 '20 10:06 marscher

I have a proposed fix for you problem. I'd be glad, if you can give it a try. You can install the fixing branch like this with pip:

pip install git+https://github.com/marscher/pydpc@use_long_dtype_for_indexing_and_stdlib_quicksort

marscher avatar Aug 13 '20 16:08 marscher