pyIPCA calling CCIPCA on big hdf5 dataset loads the whole dataset to memory

calling CCIPCA on big hdf5 dataset loads the whole dataset to memory

Open thypad opened this issue 11 years ago • 1 comments

I have a big dataset in hdf5 (~6gb), that means I cannot load all of it to memory at once (one of the reasons to use IPCA). When I used your code for CCIPCA, these lines of function fit():

1: X = array2d(X) 3: X = as_float_array(X, copy=self.copy)

the first one tries to load the whole dataset to memory (from disk), and the second one raises a "Memory Error", saying that the whole matrix WILL be loaded to memory. Since both are just checks, I commented them out and had no more problems using the code. But I think this should be a common use case, hence the issue. I'm not sure the problem happens with pure numpy arrays, but since h5py datasets have a similar interface (with slicing, etc), the results should be the same.

Configuration: python (2.7.6) h5py (2.3.0) numpy (1.8.1) scikit-learn (0.14.1) scipy (0.14.0)

Jul 06 '14 09:07 thypad

Thanks for opening an issue! Its cool to see someone else using this!

You're right these lines are a bit defensive and aren't necessary for the function of the code. I can't recall the use case for why I added them but it could have simply been done before passing my data to CCIPCA. It would probably be better to have a pure check that lets the user know their data is malformed. Or we can just remove those lines. If you'd like to open a PR I'd be happy to merge it in!

Jul 06 '14 15:07 kevinhughes27

pyIPCA pyIPCA copied to clipboard

calling CCIPCA on big hdf5 dataset loads the whole dataset to memory

pyIPCA
pyIPCA copied to clipboard