out of memory
I got cuda out of memory every time I continued training. But I won't get error if I load initial weight and train from the first epoch. I used my own dataset, but I thought is more likely something wrong with distributed training? Any suggestion how should I check the code?
traing from epoch 0 and everything works well
Epoch: [0][0/167] Time: 9.341s (9.341s) Speed: 2.1 samples/s Data: 7.671s (7.671s) Stage0-heatmaps: 2.215e-03 (2.215e-03) Stage1-heatmaps: 6.406e-04 (6.406e-04) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 4.953e-08 (4.953e-08) Stage1-pull: 0.000e+00 (0.000e+00) Epoch: [0][0/167] Time: 9.341s (9.341s) Speed: 2.1 samples/s Data: 7.873s (7.873s) Stage0-heatmaps: 1.990e-03 (1.990e-03) Stage1-heatmaps: 5.832e-04 (5.832e-04) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 4.789e-08 (4.789e-08) Stage1-pull: 0.000e+00 (0.000e+00) Epoch: [0][100/167] Time: 0.539s (0.651s) Speed: 37.1 samples/s Data: 0.000s (0.101s) Stage0-heatmaps: 4.487e-04 (1.019e-03) Stage1-heatmaps: 4.257e-04 (5.118e-04) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 3.724e-07 (4.452e-07) Stage1-pull: 0.000e+00 (0.000e+00) Epoch: [0][100/167] Time: 0.541s (0.651s) Speed: 36.9 samples/s Data: 0.000s (0.099s) Stage0-heatmaps: 4.705e-04 (1.050e-03) Stage1-heatmaps: 4.493e-04 (5.196e-04) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 3.321e-07 (4.364e-07) Stage1-pull: 0.000e+00 (0.000e+00) => saving checkpoint to output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3
continue training and get error
Target Transforms (if any): None=> loading checkpoint 'output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3/checkpoint.pth.tar' => loading checkpoint 'output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3/checkpoint.pth.tar' => loaded checkpoint 'output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3/checkpoint.pth.tar' (epoch 5) => loaded checkpoint 'output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3/checkpoint.pth.tar' (epoch 5) Epoch: [5][0/167] Time: 9.577s (9.577s) Speed: 2.1 samples/s Data: 8.164s (8.164s) Stage0-heatmaps: 1.595e-04 (1.595e-04) Stage1-heatmaps: 7.866e-05 (7.866e-05) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 6.155e-08 (6.155e-08) Stage1-pull: 0.000e+00 (0.000e+00) Epoch: [5][0/167] Time: 9.665s (9.665s) Speed: 2.1 samples/s Data: 7.976s (7.976s) Stage0-heatmaps: 1.904e-04 (1.904e-04) Stage1-heatmaps: 8.872e-05 (8.872e-05) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 5.090e-08 (5.090e-08) Stage1-pull: 0.000e+00 (0.000e+00) Traceback (most recent call last): File "tools/dist_train.py", line 323, in
main() File "tools/dist_train.py", line 115, in main args=(ngpus_per_node, args, final_output_dir, tb_log_dir) File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, args)
File "/kpoints/HigherHRNet-Human-Pose-Estimation/tools/dist_train.py", line 285, in main_worker
final_output_dir, tb_log_dir, writer_dict, fp16=cfg.FP16.ENABLED)
File "/kpoints/HigherHRNet-Human-Pose-Estimation/tools/../lib/core/trainer.py", line 76, in do_train
loss.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 52.00 MiB (GPU 0; 7.80 GiB total capacity; 5.73 GiB already allocated; 27.31 MiB free; 5.86 GiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fa6f3122536 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1:
root@a2bff378da93:/kpoints/HigherHRNet-Human-Pose-Estimation# /usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 20 leaked semaphores to clean up at shutdown len(cache))
I had the same problem. :(
I had the same problem
I had the same problem, did you solve it?