course-v3
course-v3 copied to clipboard
show_batch caused Cuda Out of memory error
my code followed the codes in course-v3/nbs/dl1/lesson2-download.ipynb
But got an error. I have 4 titan xp gups on the server, 3 of them are occupied, only the third gpu('2') i can use. But no matter i use the third gpu or only use cpu like: os.environ['CUDA_VISIBLE_DEVICES']='2'
or os.environ['CUDA_VISIBLE_DEVICES']='-1'
or defaults.device = torch.device('cpu')
I even use import torch; torch.cuda.set_device(2)
But The error still occurs, it still uses the first gpu (gpu '0')
system is Ubuntu16.04
the | 0 24326 C /usr/bin/python3 321MiB |
was my process.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp On | 00000000:02:00.0 Off | N/A |
| 30% 52C P2 242W / 250W | 12193MiB / 12196MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp On | 00000000:03:00.0 Off | N/A |
| 37% 57C P2 64W / 250W | 11920MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp On | 00000000:82:00.0 Off | N/A |
| 23% 30C P8 8W / 250W | 11MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp On | 00000000:83:00.0 Off | N/A |
| 42% 65C P2 87W / 250W | 12082MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 8382 C python 11861MiB |
| 0 24326 C /usr/bin/python3 321MiB |
| 1 512 C python 11909MiB |
| 3 19423 C python 12071MiB |
+-----------------------------------------------------------------------------+
Codes: Each of the two classes('0' and '1') of my image data has 500 images, in the two folders named by the class name under '/home/user/folder1', respectively.
from fastai.vision import *
import os
os.environ['CUDA_VISIBLE_DEVICES']='2'
classes = ['0', '1']
path = Path('/home/user/folder1')
defaults.device = torch.device('cpu')
data.show_batch(rows=3, figsize=(7,8))
error came:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-8-24c6f5f0db1f> in <module>
1 defaults.device = torch.device('cpu')
----> 2 data.show_batch(rows=3, figsize=(7,8))
~/.local/lib/python3.6/site-packages/fastai/basic_data.py in show_batch(self, rows, ds_type, reverse, **kwargs)
183 def show_batch(self, rows:int=5, ds_type:DatasetType=DatasetType.Train, reverse:bool=False, **kwargs)->None:
184 "Show a batch of data in `ds_type` on a few `rows`."
--> 185 x,y = self.one_batch(ds_type, True, True)
186 if reverse: x,y = x.flip(0),y.flip(0)
187 n_items = rows **2 if self.train_ds.x._square_show else rows
~/.local/lib/python3.6/site-packages/fastai/basic_data.py in one_batch(self, ds_type, detach, denorm, cpu)
166 w = dl.num_workers
167 dl.num_workers = 0
--> 168 try: x,y = next(iter(dl))
169 finally: dl.num_workers = w
170 if detach: x,y = to_detach(x,cpu=cpu),to_detach(y,cpu=cpu)
~/.local/lib/python3.6/site-packages/fastai/basic_data.py in __iter__(self)
73 def __iter__(self):
74 "Process and returns items from `DataLoader`."
---> 75 for b in self.dl: yield self.proc_batch(b)
76
77 @classmethod
~/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
346 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
347 if self._pin_memory:
--> 348 data = _utils.pin_memory.pin_memory(data)
349 return data
350
~/.local/lib/python3.6/site-packages/torch/utils/data/_utils/pin_memory.py in pin_memory(data)
53 return type(data)(*(pin_memory(sample) for sample in data))
54 elif isinstance(data, container_abcs.Sequence):
---> 55 return [pin_memory(sample) for sample in data]
56 elif hasattr(data, "pin_memory"):
57 return data.pin_memory()
~/.local/lib/python3.6/site-packages/torch/utils/data/_utils/pin_memory.py in <listcomp>(.0)
53 return type(data)(*(pin_memory(sample) for sample in data))
54 elif isinstance(data, container_abcs.Sequence):
---> 55 return [pin_memory(sample) for sample in data]
56 elif hasattr(data, "pin_memory"):
57 return data.pin_memory()
~/.local/lib/python3.6/site-packages/torch/utils/data/_utils/pin_memory.py in pin_memory(data)
45 def pin_memory(data):
46 if isinstance(data, torch.Tensor):
---> 47 return data.pin_memory()
48 elif isinstance(data, string_classes):
49 return data
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278
other information about the system:
```text
=== Software ===
python : 3.6.9
fastai : 1.0.59
fastprogress : 0.1.21
torch : 1.3.0
nvidia driver : 418.87
torch cuda : 10.1.243 / is available
torch cudnn : 7603 / is enabled
=== Hardware ===
nvidia gpus : 4
torch devices : 4
- gpu0 : 12196MB | TITAN Xp
- gpu1 : 12196MB | TITAN Xp
- gpu2 : 12196MB | TITAN Xp
- gpu3 : 12196MB | TITAN Xp
=== Environment ===
platform : Linux-4.4.0-31-generic-x86_64-with-Ubuntu-16.04-xenial
distro : #50-Ubuntu SMP Wed Jul 13 00:07:12 UTC 2016
conda env : Unknown
python : /usr/bin/python3.6
sys.path :
/usr/bin/python3
/usr/local/lib
/usr/lib/python36.zip
/usr/lib/python3.6
/usr/lib/python3.6/lib-dynload
/home/user/.local/lib/python3.6/site-packages
/usr/local/lib/python3.6/dist-packages
/usr/lib/python3/dist-packages
Solved.
This is a weird bug. I passed it after: 1, adding the following codes at the beginning of my program 2, restart the jupyter notebook's kernel 3, re-run the whole codes
repeat the step 2 and 3 for several times...
the added codes:
from fastai.vision import *
import os
import torch
os.environ['CUDA_VISIBLE_DEVICES']='2'
torch.cuda.set_device(2)
I dont know why it solved. Is it because some of the cuda memory was occupied and has not been completely cleaned up before? Why didn't these added codes above work before? Someone got a clue? Pls. tell me, Thx.
Now the nvidia-smi prints:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp On | 00000000:02:00.0 Off | N/A |
| 37% 54C P5 14W / 250W | 11872MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp On | 00000000:03:00.0 Off | N/A |
| 35% 55C P2 82W / 250W | 11920MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp On | 00000000:82:00.0 Off | N/A |
| 23% 30C P8 9W / 250W | 3582MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp On | 00000000:83:00.0 Off | N/A |
| 39% 62C P2 84W / 250W | 12082MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 8382 C python 11861MiB |
| 1 512 C python 11909MiB |
| 2 20327 C /usr/bin/python3 3571MiB |
| 3 19423 C python 12071MiB |
+-----------------------------------------------------------------------------+
In my case it was because another Jupyter python kernel was running and was eating up all the GPU memory
Solution : kill the other kernels here :