MAPIE icon indicating copy to clipboard operation
MAPIE copied to clipboard

Memory consumption or space complexity

Open nilslacroix opened this issue 2 years ago • 10 comments

Is your documentation request related to a problem? Please describe. In your documentation you show time complexity but not space complexity for your different methods. I was quite confused when the CV+ ( keyword cv=10 and method="plus" ) for example needed to construct a matrice of n_training_samples*n_test_samples. This gets big quite fast, even with around 60 features and 200k samples i needed around 50 GB.

Describe the solution you'd like Adding space complexity to the table.

nilslacroix avatar May 01 '22 21:05 nilslacroix

Hey @nilslacroix , I don't think that MAPIE builds matrices of size n_training_samples*n_test_samples but only of size n_training_samples*n_estimators with n_estimators=10 in case cv=10. Am I right @vtaquet ? As for the number of features, it is not a scaling parameter of MAPIE, only on the internal model you provide.

gmartinonQM avatar May 02 '22 07:05 gmartinonQM

I thought so too, but when I use the parameters above I get an error because the matrice is too big, and the given sice is exactly what I describe. So despite the fact that I use CV+ Method (by parameters as described in the docs) a n_training_samples*n_test_samples matrice is constructed.

Maybe there is an size argument not working properly?

nilslacroix avatar May 02 '22 09:05 nilslacroix

For X_test = 149689 and X_train = 152363 during the prediction I got this error. As you can see method is plus and cv=5 for a standard default lgbm regressor.

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Input In [72], in <cell line: 3>()
      1 mapie = MapieRegressor(grid_search.best_estimator_, method="plus", cv=5, n_jobs=multiprocessing.cpu_count()-1)
      2 mapie.fit(X_train, y_train)
----> 3 y_pred, y_interval = mapie.predict(X_test, alpha = 0.20)
      4 y_low, y_up =  y_interval[:, 0, :],  y_interval[:, 1, :]
      6 score_coverage = regression_coverage_score(np.expm1(y_test), np.expm1(y_low), np.expm1(y_up))

File ~\miniconda3\envs\Master_ML\lib\site-packages\mapie-0.3.2-py3.9.egg\mapie\regression.py:652, in MapieRegressor.predict(self, X, ensemble, alpha)
    639 y_pred_multi = np.column_stack([e.predict(X) for e in self.estimators_])
    641 # At this point, y_pred_multi is of shape
    642 # (n_samples_test, n_estimators_).
    643 # If ``method`` is "plus":
   (...)
    649 #       ``aggregate_with_mask`` fits it to the right size
    650 #       thanks to the shape of k_.
--> 652 y_pred_multi = self.aggregate_with_mask(y_pred_multi, self.k_)
    654 if self.method == "plus":
    655     if self.residual_score_.sym:

File ~\miniconda3\envs\Master_ML\lib\site-packages\mapie-0.3.2-py3.9.egg\mapie\regression.py:439, in MapieRegressor.aggregate_with_mask(self, x, k)
    437 if self.agg_function in ["mean", None]:
    438     K = np.nan_to_num(k, nan=0.0)
--> 439     return np.matmul(x, (K / (K.sum(axis=1, keepdims=True))).T)
    440 raise ValueError("The value of self.agg_function is not correct")

MemoryError: Unable to allocate 170. GiB for an array with shape (149689, 152363) and data type float64

nilslacroix avatar May 04 '22 17:05 nilslacroix

Hi @nilslacroix , thanks for your feedback ! This is indeed a problem that can arise when the test set has a higher number of samples. As explained in the theoretical description of the documentation (see Fig of CV+), MAPIE needs to compute the distribution of residuals and predictions for all training samples for each test point. MAPIE does it in a vectorized way with two-dimensional arrays of size (n_train_samples, n_test_samples), hence exceeding the memory when the number of training and test samples are high.

We will fix the problem in a later PR by dividing the test set into batches. Meanwhile, we invite you to split your test set explicitly and call mapie.predict() in a loop.

vtaquet avatar May 05 '22 15:05 vtaquet

This is a significant barrier to using this package in my opinion/experience, and it seems to be avoidable. Could you not calculate the quantiles from self.conformity_scores_, then add those to original predictions directly? Something like:

bounds = np_nanquantile(self.conformity_scores_, 1 - alpha_np)  # shape: (len(alpha),)
lower_bounds = np.add(y_pred[:, np.newaxis], -bounds)  # shape: (n_test_samples, len(alpha))
upper_bounds = np.add(y_pred[:, np.newaxis], bounds)  # shape: (n_test_samples, len(alpha))

This may only work with naive and base; I admit I don't fully understand how plus and minmax operate.

robert-robison avatar Oct 06 '22 17:10 robert-robison

Hi @nilslacroix , thanks for your feedback ! This is indeed a problem that can arise when the test set has a higher number of samples. As explained in the theoretical description of the documentation (see Fig of CV+), MAPIE needs to compute the distribution of residuals and predictions for all training samples for each test point. MAPIE does it in a vectorized way with two-dimensional arrays of size (n_train_samples, n_test_samples), hence exceeding the memory when the number of training and test samples are high.

We will fix the problem in a later PR by dividing the test set into batches. Meanwhile, we invite you to split your test set explicitly and call mapie.predict() in a loop.

I still have this issue in mapie==0.6.1

nanophyto avatar Mar 21 '23 10:03 nanophyto

I have this same problem in mapie==0.6.5. It's trying to allocated over 50gb. I like mapie a lot, but I can't use this lib if it's going to do such naive approaches. Please advise if this will be fixed any time soon.

scottee avatar Sep 01 '23 21:09 scottee

Hello @scottee, I recommend that you consult other issues that have this problem. Without context, it's difficult for us to understand your particular problem. But here are the answers I've been able to provide: https://github.com/scikit-learn-contrib/MAPIE/issues/328 https://github.com/scikit-learn-contrib/MAPIE/issues/326

TL;DR: A priori, this is not a problem with prefit mode. This is a problem that can arise when the calibration set and test test have a larger number of samples. This behavior is unintended, as the predict method is generally used with a smaller number of test samples during inference.

TL;DR: This is a problem that can arise when the calibration and the test set have a larger number of samples. This behavior is unintended, as the predict method, called in the fit method of MapieTimeSeriesRegressor, is generally used with a smaller number of test samples during inference.

Recommendation: prefer to use a smaller calibration set. MAPIE will still be just as effective, but will run faster (200k samples is too unreasonable). The cv feature should be used if you don't have many samples. On the contrary, use the prefit or split features.

thibaultcordier avatar Sep 04 '23 07:09 thibaultcordier

Actually, I would recommend to implement the looping process described by @vtaquet directly within the predict methods of all Mapie classes, essentially by inheritance of a base class. This class would typically have a predict method with an additional argument called batch_size (like in e.g. tensorflow).

For example :

alpha = [0.01, 0.05]
model = MapieRegressor(base_model, method="plus", cv=5)
model.fit(X_train, y_train)
y_preds, y_pis = model.predict(X_test, alpha=alpha, batch_size=100)

Which could be roughly equivalent to

alpha = [0.01, 0.05]
model = MapieRegressor(base_model, method="plus", cv=5)
model.fit(X_train, y_train)
# code that initiate y_preds and y_pis
n_batches = len(X_test) // batch_size
for X_test_batch in np.array_split(X_test, n_batches):
    y_preds_batch, y_pis_batch = model.predict(X_test_batch, alpha=alpha)
    # code that populate y_preds and y_pis with the results of the current batch

gmartinonQM avatar Jan 08 '24 16:01 gmartinonQM