alibi-detect Pytorch cuda out of memory error in cd_mmd

Hi,

I am using nvidia RTX3080 graphic card. It has 10240 MB GDDR6X memory. When I execute cd_mmd_cifar10.ipynb example, I have no problem with tensorflow codes. Out of memory error is occurred in pytorch codes. So, I tried to reduce batch_size up to 1 in all sizes in code and got the same errors even though batch_size is 1. As you can see the below code blocks, sigma calculation is the error line. How can I fix this error? Should I use graphic card having more memory?

Code blocks:

from alibi_detect.cd.pytorch import preprocess_drift

** define encoder encoder_net = nn.Sequential( nn.Conv2d(3, 64, 4, stride=2, padding=0), nn.ReLU(), nn.Conv2d(64, 128, 4, stride=2, padding=0), nn.ReLU(), nn.Conv2d(128, 512, 4, stride=2, padding=0), nn.ReLU(), nn.Flatten(), nn.Linear(2048, encoding_dim) ).to(device).eval()

** define preprocessing function preprocess_fn = partial(preprocess_drift, model=encoder_net, device=device, batch_size=128)

** initialise drift detector cd = MMDDrift(X_ref_pt, backend='pytorch', p_val=.05, preprocess_fn=preprocess_fn, n_permutations=100)

Error messages:

RuntimeError Traceback (most recent call last) C:\Users\xxx\AppData\Local\Temp/ipykernel_17800/867170961.py in 17 18 # initialise drift detector ---> 19 cd = MMDDrift(X_ref_pt, backend='pytorch', p_val=.05, 20 preprocess_fn=preprocess_fn, n_permutations=100)

~\Miniconda3\envs\alibidet\lib\site-packages\alibi_detect\cd\mmd.py in init(self, x_ref, backend, p_val, preprocess_x_ref, update_x_ref, preprocess_fn, kernel, sigma, configure_kernel_from_x_ref, n_permutations, device, input_shape, data_type) 91 self._detector = MMDDriftTF(*args, **kwargs) # type: ignore 92 else: ---> 93 self._detector = MMDDriftTorch(*args, **kwargs) # type: ignore 94 self.meta = self._detector.meta 95

~\Miniconda3\envs\alibidet\lib\site-packages\alibi_detect\cd\pytorch\mmd.py in init(self, x_ref, p_val, preprocess_x_ref, update_x_ref, preprocess_fn, kernel, sigma, configure_kernel_from_x_ref, n_permutations, device, input_shape, data_type) 89 if self.infer_sigma or isinstance(sigma, torch.Tensor): 90 x = torch.from_numpy(self.x_ref).to(self.device) ---> 91 self.k_xx = self.kernel(x, x, infer_sigma=self.infer_sigma) 92 self.infer_sigma = False 93 else:

~\Miniconda3\envs\alibidet\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs) 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1101 or _global_forward_hooks or _global_forward_pre_hooks): -> 1102 return forward_call(*input, **kwargs) 1103 # Do not call functions when jit is used 1104 full_backward_hooks, non_full_backward_hooks = [], []

~\Miniconda3\envs\alibidet\lib\site-packages\alibi_detect\utils\pytorch\kernels.py in forward(self, x, y, infer_sigma) 51 #sigma = (.5 * dist.flatten().sort().values[n_median].unsqueeze(dim=-1)) ** .5 52 with torch.no_grad(): ---> 53 sigma = (.5 * dist.flatten().sort().values[n_median].unsqueeze(dim=-1)) ** .5 54 self.log_sigma.copy_(sigma.log().clone()) 55 self.init_required = False

RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 10.00 GiB total capacity; 197.38 MiB already allocated; 0 bytes free; 214.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Jan 18 '22 07:01 KevinRyu

Hi @KevinRyu, that is strange. I would have thought that 10GB of memory would be enough for this example, and as you point out, it is with backend='tensorflow'. Do you by any chance have something like htop installed? It would be quite helpful if you could have that open whilst you try to instantiate the MMDDrift detector a few times. What would be helpful to know is whether the memory usage is accumulating over successive pytorch runs, or all in one go i.e. is it an issue with the pytorch implementation not "letting go" of memory.

I will also try to reproduce and do some memory profiling on my end in the near future.

Jan 18 '22 09:01 ascillitoe

Just to add, I would be keen to see the memory consumption whilst instantiating the detector using a tool like nvidia-smi. It's possible it's something to do with memory configuration for pytorch.

Jan 18 '22 09:01 jklaise

Do you have a TensorFlow code/notebook running at the same time? It might be Tensorflow reserves all the GRAM so not much is left for Torch to use.

Jan 18 '22 09:01 Srceh

Do you have a TensorFlow code/notebook running at the same time? It might be Tensorflow reserves all the GRAM so not much is left for Torch to use.

Hi, Tensorflow is also executed to load cifar10 from tf.keras.datasets at start line in this example. I also suspected the memory possession of tensorflow. But, I think this is usual syntax. Of course, it's just doubt. , it could be a conflict between tensorflow and pytorch version. Anyway, batch_size=1 was not useful. So, your comment seems credible. I don't know how to solve it.

from functools import partial import matplotlib.pyplot as plt import numpy as np import tensorflow as tf

from alibi_detect.cd import MMDDrift from alibi_detect.models.tensorflow.resnet import scale_by_instance from alibi_detect.utils.fetching import fetch_tf_model from alibi_detect.utils.saving import save_detector, load_detector from alibi_detect.datasets import fetch_cifar10c, corruption_types_cifar10c

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data() X_train = X_train.astype('float32') / 255 X_test = X_test.astype('float32') / 255 y_train = y_train.astype('int64').reshape(-1,) y_test = y_test.astype('int64').reshape(-1,)

Jan 18 '22 23:01 KevinRyu

Hi,

I did some test to figure out the reason. As a result, the error below is not solved yet. In cd_clf_cifar10.ipynb, the out of memory error below is also occurred. Of course, kind of error is different with previous one. But when I skipped tensorflow related code lines, pytorch part could be successfully executed. Adjusting batch_size to 1 was useless. By the way, I have a question. What kind of graphic card are you using for test? V100 or A100? In my case, as I mentioned, it's rtx3080 in windows 10 OS. I think it's better to distinguish tensorflow parts with pytorch parts.. What do you think?

cd_clf_cifar10.ipynb error: RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

cd_mmd_cifar10.ipynb error:

RuntimeError Traceback (most recent call last) C:\Users\xxx\AppData\Local\Temp/ipykernel_23136/3450546895.py in 19 20 # initialise drift detector ---> 21 cd = MMDDrift(X_ref_pt, backend='pytorch', p_val=.05, 22 preprocess_fn=preprocess_fn, n_permutations=100)

~\Miniconda3\envs\alibidet\lib\site-packages\alibi_detect\cd\mmd.py in init(self, x_ref, backend, p_val, preprocess_x_ref, update_x_ref, preprocess_fn, kernel, sigma, configure_kernel_from_x_ref, n_permutations, device, input_shape, data_type) 91 self._detector = MMDDriftTF(*args, **kwargs) # type: ignore 92 else: ---> 93 self._detector = MMDDriftTorch(*args, **kwargs) # type: ignore 94 self.meta = self._detector.meta 95

~\Miniconda3\envs\alibidet\lib\site-packages\alibi_detect\cd\pytorch\mmd.py in init(self, x_ref, p_val, preprocess_x_ref, update_x_ref, preprocess_fn, kernel, sigma, configure_kernel_from_x_ref, n_permutations, device, input_shape, data_type) 90 if self.infer_sigma or isinstance(sigma, torch.Tensor): 91 x = torch.from_numpy(self.x_ref).to(self.device) ---> 92 self.k_xx = self.kernel(x, x, infer_sigma=self.infer_sigma) 93 self.infer_sigma = False 94 else:

~\Miniconda3\envs\alibidet\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs) 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1101 or _global_forward_hooks or _global_forward_pre_hooks): -> 1102 return forward_call(*input, **kwargs) 1103 # Do not call functions when jit is used 1104 full_backward_hooks, non_full_backward_hooks = [], []

~\Miniconda3\envs\alibidet\lib\site-packages\alibi_detect\utils\pytorch\kernels.py in forward(self, x, y, infer_sigma) 49 n = n if (x[:n] == y[:n]).all() and x.shape == y.shape else 0 50 n_median = n + (np.prod(dist.shape) - n) // 2 - 1 ---> 51 sigma = (.5 * dist.flatten().sort().values[n_median].unsqueeze(dim=-1)) ** .5 52 with torch.no_grad(): 53 self.log_sigma.copy_(sigma.log().clone())

RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 10.00 GiB total capacity; 197.38 MiB already allocated; 0 bytes free; 214.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Jan 20 '22 07:01 KevinRyu

Thanks for looking into this @KevinRyu! I've just tried to reproduce on our end by simply taking the cd_mmd_cifar10.ipynb notebook and running it on a GPU in Google colab, with some !nvidia-smi commands sprinkled around. It seems you are definitely correct in your hypothesis, TensorFlow isn't giving up the GPU RAM once we've finished with it. The image below shows the GPU RAM; the first step change is when the tensorflow MMDDrift detector is instantiated, and the second for the pytorch one. Colab gave me a Tesla T4, and in total the above took up 12.6 GiB RAM out of the 15.1 GiB available on the GPU. So looks like you were just a little short on RAM @KevinRyu.

Solution

With respect to getting TensorFlow to relinquish the RAM, it looks like this is a common issue without a reliable solution (https://github.com/tensorflow/tensorflow/issues/36465), short of killing the Python process. That being said, I can't think of many real-world scenarios where there's a need to run with TensorFlow and then PyTorch under the same Python process.

Since we are moving towards a more modular alibi-detect, where the user only has to install TensorFlow or PyTorch, we will rework the notebooks so that they only use one of the backend's at a time. @KevinRyu if you still want to run any of the more memory-hungry notebooks in the meantime, I suggest deleting the necessary cells so that only one backend is used, as you already suggested. Thanks again for flagging up this issue! 🙂

Jan 20 '22 17:01 ascillitoe

p.s. @KevinRyu locally we currently use RTX 5000's and 2080Ti's. Notwithstanding the issue you've discovered, your GPU should be enough for most of the Alibi Detect examples. If you do need a little bit more memory/compute you could try using Google Colab too. This comes with Tensorflow 2.7 pre-installed, so you should just need to drop in !pip install alibi-detect before the imports and you'll be good to go.

Jan 20 '22 17:01 ascillitoe

p.s. @KevinRyu locally we currently use RTX 5000's and 2080Ti's. Notwithstanding the issue you've discovered, your GPU should be enough for most of the Alibi Detect examples. If you do need a little bit more memory/compute you could try using Google Colab too. This comes with Tensorflow 2.7 pre-installed, so you should just need to drop in !pip install alibi-detect before the imports and you'll be good to go.

Hi,

I got a solution after about searching about tensorflow's memory allocation and getting hints from Hao Song's reply and yours. Thank you very much!

Solution: I added the below code to the tensorflow loading code.

physical_devices = tf.config.list_physical_devices('GPU') tf.config.experimental.set_memory_growth(physical_devices[0], True)

You know, this function let us set memory as necessary, at least allocate first. Now minimum amount of memory is allocated. (In my case, just 1.1GB) And then, I skipped loading code lines of random encoder and BBSDs model. When I run pytorch codes, amount of memory is 3.6GB in my environment.

I think this is the best way what I can do at current example code structures. Pytorch code is working now. There's no out of memory problem anymore. :)

Jan 20 '22 22:01 KevinRyu

p.s.

I don't open new issue about the other error. Anyway, I added solution what I found to here. In cd_model_unc_cifar10_wine.ipynb example, .cuda() should be added to reg model variable. If this is not applied, I got below error.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') trainer(reg.cuda(), nn.MSELoss(), X_train_dl, device, torch.optim.Adam, learning_rate=0.001, epochs=30)

If you think this is problem, update at next release please. That's all.

Jan 20 '22 23:01 KevinRyu

Hi @KevinRyu, just opening this again to ensure we don't miss the above.

Jan 21 '22 10:01 ascillitoe

Hi,

Did you finish checking my replies? Is there anything should I do about this? I just wanted to save the extra work of closing on behalf of the issuer. And, I want to know about your opinion.

2022년 1월 21일 (금) 오후 7:47, Ashley Scillitoe @.***>님이 작성:

Hi @KevinRyu https://github.com/KevinRyu, just opening this again to ensure we don't miss the above.

— Reply to this email directly, view it on GitHub https://github.com/SeldonIO/alibi-detect/issues/432#issuecomment-1018393073, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG43MBNEYEGD6P4JS6P6OQLUXE2SPANCNFSM5MGOOFAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

Jan 25 '22 12:01 KevinRyu

Hi @KevinRyu, sorry I missed the bit regarding cd_model_unc_cifar10_wine.ipynb, I've opened another issue regarding that just so we don't miss it.

Jan 25 '22 13:01 ascillitoe

No problem, thanks!

2022년 1월 25일 (화) 오후 10:58, Ashley Scillitoe @.***>님이 작성:

Hi @KevinRyu https://github.com/KevinRyu, sorry I missed the bit regarding cd_model_unc_cifar10_wine.ipynb, I've opened another issue regarding that just so we don't miss it.

— Reply to this email directly, view it on GitHub https://github.com/SeldonIO/alibi-detect/issues/432#issuecomment-1021208427, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG43MBNU63PVWL5WY73PKTDUX2T7HANCNFSM5MGOOFAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

Jan 25 '22 23:01 KevinRyu

When full code is executed, OOM(out of memory) error is also occurred In cd_text_imdb.ipynb with RTX3080. So, we have to consider running by model code blocks. Please refer to this when you refactor structure of example codes.

Jan 27 '22 01:01 KevinRyu

Thanks @KevinRyu , good to know!

Jan 27 '22 09:01 ascillitoe

Two import statements fill up my entire 50 GB GPU (independently of each other):

from alibi_detect.cd import SpotTheDiffDrift and from alibi_detect.utils.pytorch.data import TorchDataset

It seems odd to me that a simple library import would use up so much RAM.

Mar 31 '22 20:03 rclosson

Hi @rclosson , that should definitely not happen. Can you share your alibi-detect, tensorflow and torch versions?

Mar 31 '22 20:03 arnaudvl

alibi-detect: 0.9.0 tensorflow: 2.8.0 torch: 1.9.0+cu111

Likely relevant info: I'm running in a Jupyter notebook

Mar 31 '22 20:03 rclosson

I'm running on an NVIDIA Quadro RTX 8000, which is allocated to me through the determined.ai interface of sharing GPUs among AI researchers. (Sorry if I'm using the wrong terms or adding irrelevant info here.)

Mar 31 '22 21:03 rclosson

Thanks @rclosson ! Do you have the same issue when running those imports (from alibi_detect.cd import SpotTheDiffDrift and from alibi_detect.utils.pytorch.data import TorchDataset) outside of a notebook (e.g. terminal) as well?

Apr 01 '22 08:04 arnaudvl

Hi @rclosson , we did a v0.9.1 patch release to fix this issue. Check the changelog for more details on the issue.

Apr 01 '22 15:04 arnaudvl

Sorry, I'm not sure where I can test outside of a notebook given my specific environment (and the fact that I use shared/pooled resources with my colleagues, so it may be unsafe for me to work outside of the containerized Jupyter environment). If I think of a way to safely test that for you I will.

However, I can confirm that your patch has fixed the problem for from alibi_detect.utils.pytorch.data import TorchDataset. THANK YOU. Will test SpotTheDiffDrift presently.

Apr 01 '22 16:04 rclosson

SpotTheDiffDrift also not taking up all my GPU memory on the first import statement anymore. Thanks again.

Apr 01 '22 18:04 rclosson

alibi-detect
alibi-detect copied to clipboard

Pytorch cuda out of memory error in cd_mmd_cifar10.ipynb

Solution

alibi-detect alibi-detect copied to clipboard

Pytorch cuda out of memory error in cd_mmd_cifar10.ipynb

Solution

alibi-detect
alibi-detect copied to clipboard