pytorch_hand_classifier
pytorch_hand_classifier copied to clipboard
Google colab平台运行项目时出现共享内存不足
由于自己的笔记本无GPU加速,所以将此项目运行在Google colab平台上。colab提供免费的GPU,但是运行时发现 共享内存不足,不知道有没有其他同学遇到此问题?
2018-07-28 12:48:24 [INFO]: Start training epoch 1
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "classifier_train.py", line 65, in <module>
trainer.train()
File "/content/drive/pytorch_hand_classifier/utils/Trainer.py", line 104, in train
self._train_one_epoch()
File "/content/drive/pytorch_hand_classifier/utils/Trainer.py", line 133, in _train_one_epoch
for step, (data, label) in enumerate(self.train_data):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 275, in __next__
idx, batch = self._get_batch()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 254, in _get_batch
return self.data_queue.get()
File "/usr/lib/python3.6/multiprocessing/queues.py", line 335, in get
res = self._reader.recv_bytes()
File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 175, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 2068) is killed by signal: Bus error.
毕竟是免费的啊/狗头 正经回答:减小batch_size,比如原来是64的话,可以改成32或更小,但预测的精度会受到影响, 这是个tradeoff。并且,记得首先要清除显存,有时候当前程序终止后不会立即清除显存中的缓存,要手动杀死相关的线程清除。 有意思的是,我发现谷歌会自动分配空闲的GPU给你,比如在训练开始的时候你的GPU显存是8G,跑到中间你会发现可能会被换成一张16G的卡,随缘吧。