apricot
apricot copied to clipboard
`partial_fit` and `sieve` can easily outgrow available memory
Thank you for putting together such a great library. It's been extremely helpful.
I was toying with the parameters in the example in the documentation on massive datasets. I realized that when using partial_fit
(and therefore the sieve
optimizer) and slightly more features or I set my target sample size to something larger, it is easy to get a memory error. Here is an example that I tried:
# apricot-massive-dataset-example.py
from apricot import FeatureBasedSelection
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
train_data = fetch_20newsgroups(subset='train', categories=('sci.med', 'sci.space'))
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_data.data) # This returns a sparse matrix which is supported in apricot
print(X_train.shape)
selector = FeatureBasedSelection(1000, concave_func='sqrt', verbose=False)
selector.partial_fit(X_train)
Running the above, I get:
$ python apricot-massive-dataset-example.py
(1187, 25638)
Traceback (most recent call last):
File "apricot-example.py", line 12, in <module>
selector.partial_fit(X_train)
File "/envs/bla/lib/python3.8/site-packages/apricot/functions/base.py", line 258, in partial_fit
self.optimizer.select(X, self.n_samples, sample_cost=sample_cost)
File "/envs/bla/lib/python3.8/site-packages/apricot/optimizers.py", line 1103, in select
self.function._calculate_sieve_gains(X, thresholds, idxs)
File "/envs/bla/lib/python3.8/site-packages/apricot/functions/featureBased.py", line 360, in _calculate_sieve_gains
super(FeatureBasedSelection, self)._calculate_sieve_gains(X,
File "/envs/bla/lib/python3.8/site-packages/apricot/functions/base.py", line 418, in _calculate_sieve_gains
self.sieve_subsets_ = numpy.zeros((l, self.n_samples, self._X.shape[1]), dtype='float32')
numpy.core._exceptions.MemoryError: Unable to allocate 117. GiB for an array with shape (1227, 1000, 25638) and data type float32
This behavior doesn't happen when I use fit()
and another optimizer, e.g., two-stage
.
Looking into the code, it seems that the root is at an array initialization of sieve_subsets_
, and can happen again later here. In both places, we ask for a zero, float64
, non-sparse matrix
, of size |thresholds| x |n_samples| x |feature_dims|
, which can become quite large and not fit in memory when dealing with massive datasets. I wonder if there is a more memory efficient way of writing it? Thanks!