pyIPCA
pyIPCA copied to clipboard
calling CCIPCA on big hdf5 dataset loads the whole dataset to memory
I have a big dataset in hdf5 (~6gb), that means I cannot load all of it to memory at once (one of the reasons to use IPCA). When I used your code for CCIPCA, these lines of function fit():
1: X = array2d(X) 3: X = as_float_array(X, copy=self.copy)
the first one tries to load the whole dataset to memory (from disk), and the second one raises a "Memory Error", saying that the whole matrix WILL be loaded to memory. Since both are just checks, I commented them out and had no more problems using the code. But I think this should be a common use case, hence the issue. I'm not sure the problem happens with pure numpy arrays, but since h5py datasets have a similar interface (with slicing, etc), the results should be the same.
Configuration: python (2.7.6) h5py (2.3.0) numpy (1.8.1) scikit-learn (0.14.1) scipy (0.14.0)
Thanks for opening an issue! Its cool to see someone else using this!
You're right these lines are a bit defensive and aren't necessary for the function of the code. I can't recall the use case for why I added them but it could have simply been done before passing my data to CCIPCA. It would probably be better to have a pure check that lets the user know their data is malformed. Or we can just remove those lines. If you'd like to open a PR I'd be happy to merge it in!