pytorch-pruning icon indicating copy to clipboard operation
pytorch-pruning copied to clipboard

CUDA out of memory error: while allocating the memory

Open jagadeesh09 opened this issue 6 years ago • 7 comments

Hi

I am working on Tesla K40, 12 GB GPU machine. I am facing this error constantly. If I calculate the required memory for VGG model with respect to the mentioned batch size in dataset.py , the required memory is far less than the available memory of GPU. What could be the reason and how to overcome this? I am facing this after initializing the model and while calling cuda() also.

THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCCachingHostAllocator.cpp line=258 error=2 : out of memory Traceback (most recent call last): File "finetune.py", line 272, in fine_tuner.train(epoches = 20) File "finetune.py", line 163, in train self.train_epoch(optimizer) File "finetune.py", line 182, in train_epoch for batch, label in self.train_data_loader: File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 281, in next return self._process_next_batch(batch) File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 301, in process_next_batch raise batch.exc_type(batch.exc_msg) RuntimeError: Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 81, in worker_manager_loop batch = pin_memory_batch(batch) File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 148, in pin_memory_batch return [pin_memory_batch(sample) for sample in batch] File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 142, in pin_memory_batch return batch.pin_memory() File "/usr/local/lib/python2.7/dist-packages/torch/tensor.py", line 92, in pin_memory return type(self)().set(storage.pin_memory()).view_as(self) File "/usr/local/lib/python2.7/dist-packages/torch/storage.py", line 87, in pin_memory return type(self)(self.size(), allocator=allocator).copy(self) RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/THCCachingHostAllocator.cpp:258

jagadeesh09 avatar Mar 15 '18 07:03 jagadeesh09

@jagadeesh09 Model parameters are not the only one that occupies the GPU memory, reduce batch_size to 16 or smaller would help.

guangzhili avatar Mar 22 '18 07:03 guangzhili

@guangzhili

When I change the two lines of batch_size value in dataset.py from 32 to 16 , I have the following error. Why ?

[phung@archlinux pytorch-pruning]$ python finetune.py --train
/usr/lib/python3.7/site-packages/torchvision/transforms/transforms.py:187: UserWarning: The use of the transforms.Scale transform is deprecated, please use transforms.Resize instead.
  warnings.warn("The use of the transforms.Scale transform is deprecated, " +
/usr/lib/python3.7/site-packages/torchvision/transforms/transforms.py:562: UserWarning: The use of the transforms.RandomSizedCrop transform is deprecated, please use transforms.RandomResizedCrop instead.
  warnings.warn("The use of the transforms.RandomSizedCrop transform is deprecated, " +
Epoch:  0
Accuracy:  0.3398
Epoch:  1
Accuracy:  0.8265
Epoch:  2
Accuracy:  0.6071
Epoch:  3
Accuracy:  0.63
Epoch:  4
Accuracy:  0.5951
Epoch:  5
Accuracy:  0.5837
Epoch:  6
Accuracy:  0.5537
Epoch:  7
Accuracy:  0.5672
Epoch:  8
Accuracy:  0.506
Epoch:  9
Accuracy:  0.5962
Epoch:  10
Accuracy:  0.6039
Epoch:  11
Accuracy:  0.5436
Epoch:  12
Accuracy:  0.6215
Epoch:  13
Accuracy:  0.5622
Epoch:  14
Accuracy:  0.5872
Epoch:  15
Accuracy:  0.5969
Epoch:  16
Accuracy:  0.5741
Epoch:  17
Accuracy:  0.5725
Epoch:  18
Accuracy:  0.6213
Epoch:  19
Accuracy:  0.6483
Finished fine tuning.
[phung@archlinux pytorch-pruning]$ python finetune.py --prune
/usr/lib/python3.7/site-packages/torchvision/transforms/transforms.py:187: UserWarning: The use of the transforms.Scale transform is deprecated, please use transforms.Resize instead.
  warnings.warn("The use of the transforms.Scale transform is deprecated, " +
/usr/lib/python3.7/site-packages/torchvision/transforms/transforms.py:562: UserWarning: The use of the transforms.RandomSizedCrop transform is deprecated, please use transforms.RandomResizedCrop instead.
  warnings.warn("The use of the transforms.RandomSizedCrop transform is deprecated, " +
Accuracy:  0.6483
Number of prunning iterations to reduce 67% filters 5
Ranking filters.. 
Traceback (most recent call last):
  File "finetune.py", line 270, in <module>
    fine_tuner.prune()
  File "finetune.py", line 217, in prune
    prune_targets = self.get_candidates_to_prune(num_filters_to_prune_per_iteration)
  File "finetune.py", line 186, in get_candidates_to_prune
    self.prunner.normalize_ranks_per_layer()
  File "finetune.py", line 101, in normalize_ranks_per_layer
    v = v / np.sqrt(torch.sum(v * v))
  File "/usr/lib/python3.7/site-packages/torch/tensor.py", line 432, in __array__
    return self.numpy()
TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
[phung@archlinux pytorch-pruning]$ 

buttercutter avatar Oct 10 '18 00:10 buttercutter

I have already installed the latest pytorch since it had solved this tensor.cpu() problem

I am using https://www.archlinux.org/packages/community/x86_64/python-pytorch-cuda/

So, what is actually still triggering this tensor.cpu() issue ?

buttercutter avatar Oct 15 '18 16:10 buttercutter

v = v / np.sqrt(torch.sum(v * v))

replace np.sqrt(torch.sum(v * v)) by v.norm

It worked for me. I think that np.sqrt() requires a variable on cpu, not gpu

nguyenbh1507 avatar Oct 11 '19 08:10 nguyenbh1507

I observed that the out-of-memory still occurs even I change batch size to 16. The first round was OK, but the second wasn't. I think we should delete the previous redundant unused model on GPU to free up memory before allocating the new one.

nguyenbh1507 avatar Oct 11 '19 08:10 nguyenbh1507

I met a similiar issue, and solved it by setting pin_memory=false.
https://discuss.pytorch.org/t/using-pined-memory-causes-out-of-memory-error-even-though-batch-size-is-set-to-low-values/30602

ChaoLi977 avatar Feb 14 '20 04:02 ChaoLi977

I met a similiar issue, and solved it by setting pin_memory=false. https://discuss.pytorch.org/t/using-pined-memory-causes-out-of-memory-error-even-though-batch-size-is-set-to-low-values/30602

Could you clarify the path for the pin_memory? How can I change it into false?

akbarali2019 avatar Apr 07 '22 21:04 akbarali2019