course-v3 icon indicating copy to clipboard operation
course-v3 copied to clipboard

show_batch caused Cuda Out of memory error

Open Light-- opened this issue 5 years ago • 2 comments

my code followed the codes in course-v3/nbs/dl1/lesson2-download.ipynb But got an error. I have 4 titan xp gups on the server, 3 of them are occupied, only the third gpu('2') i can use. But no matter i use the third gpu or only use cpu like: os.environ['CUDA_VISIBLE_DEVICES']='2' or os.environ['CUDA_VISIBLE_DEVICES']='-1' or defaults.device = torch.device('cpu') I even use import torch; torch.cuda.set_device(2)

But The error still occurs, it still uses the first gpu (gpu '0')

system is Ubuntu16.04
the | 0 24326 C /usr/bin/python3 321MiB | was my process.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            On   | 00000000:02:00.0 Off |                  N/A |
| 30%   52C    P2   242W / 250W |  12193MiB / 12196MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            On   | 00000000:03:00.0 Off |                  N/A |
| 37%   57C    P2    64W / 250W |  11920MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            On   | 00000000:82:00.0 Off |                  N/A |
| 23%   30C    P8     8W / 250W |     11MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            On   | 00000000:83:00.0 Off |                  N/A |
| 42%   65C    P2    87W / 250W |  12082MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      8382      C   python                                     11861MiB |
|    0     24326      C   /usr/bin/python3                             321MiB |
|    1       512      C   python                                     11909MiB |
|    3     19423      C   python                                     12071MiB |
+-----------------------------------------------------------------------------+


Codes: Each of the two classes('0' and '1') of my image data has 500 images, in the two folders named by the class name under '/home/user/folder1', respectively.

from fastai.vision import *
import os
os.environ['CUDA_VISIBLE_DEVICES']='2'

classes = ['0', '1']
path = Path('/home/user/folder1')
defaults.device = torch.device('cpu')
data.show_batch(rows=3, figsize=(7,8))

error came:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-8-24c6f5f0db1f> in <module>
      1 defaults.device = torch.device('cpu')
----> 2 data.show_batch(rows=3, figsize=(7,8))

~/.local/lib/python3.6/site-packages/fastai/basic_data.py in show_batch(self, rows, ds_type, reverse, **kwargs)
    183     def show_batch(self, rows:int=5, ds_type:DatasetType=DatasetType.Train, reverse:bool=False, **kwargs)->None:
    184         "Show a batch of data in `ds_type` on a few `rows`."
--> 185         x,y = self.one_batch(ds_type, True, True)
    186         if reverse: x,y = x.flip(0),y.flip(0)
    187         n_items = rows **2 if self.train_ds.x._square_show else rows

~/.local/lib/python3.6/site-packages/fastai/basic_data.py in one_batch(self, ds_type, detach, denorm, cpu)
    166         w = dl.num_workers
    167         dl.num_workers = 0
--> 168         try:     x,y = next(iter(dl))
    169         finally: dl.num_workers = w
    170         if detach: x,y = to_detach(x,cpu=cpu),to_detach(y,cpu=cpu)

~/.local/lib/python3.6/site-packages/fastai/basic_data.py in __iter__(self)
     73     def __iter__(self):
     74         "Process and returns items from `DataLoader`."
---> 75         for b in self.dl: yield self.proc_batch(b)
     76 
     77     @classmethod

~/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    346         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    347         if self._pin_memory:
--> 348             data = _utils.pin_memory.pin_memory(data)
    349         return data
    350 

~/.local/lib/python3.6/site-packages/torch/utils/data/_utils/pin_memory.py in pin_memory(data)
     53         return type(data)(*(pin_memory(sample) for sample in data))
     54     elif isinstance(data, container_abcs.Sequence):
---> 55         return [pin_memory(sample) for sample in data]
     56     elif hasattr(data, "pin_memory"):
     57         return data.pin_memory()

~/.local/lib/python3.6/site-packages/torch/utils/data/_utils/pin_memory.py in <listcomp>(.0)
     53         return type(data)(*(pin_memory(sample) for sample in data))
     54     elif isinstance(data, container_abcs.Sequence):
---> 55         return [pin_memory(sample) for sample in data]
     56     elif hasattr(data, "pin_memory"):
     57         return data.pin_memory()

~/.local/lib/python3.6/site-packages/torch/utils/data/_utils/pin_memory.py in pin_memory(data)
     45 def pin_memory(data):
     46     if isinstance(data, torch.Tensor):
---> 47         return data.pin_memory()
     48     elif isinstance(data, string_classes):
     49         return data

RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278

other information about the system:


```text
=== Software ===
python        : 3.6.9
fastai        : 1.0.59
fastprogress  : 0.1.21
torch         : 1.3.0
nvidia driver : 418.87
torch cuda    : 10.1.243 / is available
torch cudnn   : 7603 / is enabled

=== Hardware ===
nvidia gpus   : 4
torch devices : 4
  - gpu0      : 12196MB | TITAN Xp
  - gpu1      : 12196MB | TITAN Xp
  - gpu2      : 12196MB | TITAN Xp
  - gpu3      : 12196MB | TITAN Xp

=== Environment ===
platform      : Linux-4.4.0-31-generic-x86_64-with-Ubuntu-16.04-xenial
distro        : #50-Ubuntu SMP Wed Jul 13 00:07:12 UTC 2016
conda env     : Unknown
python        : /usr/bin/python3.6
sys.path      :
/usr/bin/python3
/usr/local/lib
/usr/lib/python36.zip
/usr/lib/python3.6
/usr/lib/python3.6/lib-dynload
/home/user/.local/lib/python3.6/site-packages
/usr/local/lib/python3.6/dist-packages
/usr/lib/python3/dist-packages

Light-- avatar Nov 07 '19 11:11 Light--

Solved.

This is a weird bug. I passed it after: 1, adding the following codes at the beginning of my program 2, restart the jupyter notebook's kernel 3, re-run the whole codes

repeat the step 2 and 3 for several times...

the added codes:

from fastai.vision import *
import os
import torch
os.environ['CUDA_VISIBLE_DEVICES']='2' 
torch.cuda.set_device(2)

I dont know why it solved. Is it because some of the cuda memory was occupied and has not been completely cleaned up before? Why didn't these added codes above work before? Someone got a clue? Pls. tell me, Thx.

Now the nvidia-smi prints:


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            On   | 00000000:02:00.0 Off |                  N/A |
| 37%   54C    P5    14W / 250W |  11872MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            On   | 00000000:03:00.0 Off |                  N/A |
| 35%   55C    P2    82W / 250W |  11920MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            On   | 00000000:82:00.0 Off |                  N/A |
| 23%   30C    P8     9W / 250W |   3582MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            On   | 00000000:83:00.0 Off |                  N/A |
| 39%   62C    P2    84W / 250W |  12082MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      8382      C   python                                     11861MiB |
|    1       512      C   python                                     11909MiB |
|    2     20327      C   /usr/bin/python3                            3571MiB |
|    3     19423      C   python                                     12071MiB |
+-----------------------------------------------------------------------------+

Light-- avatar Nov 07 '19 13:11 Light--

In my case it was because another Jupyter python kernel was running and was eating up all the GPU memory

Solution : kill the other kernels here :

image

lucasbordeau avatar Sep 12 '20 12:09 lucasbordeau