DFGN-pytorch
DFGN-pytorch copied to clipboard
关于gradient_accumulate_step
我设置一个更小的gradient_accumulate_step=5,仍然会有OOM的问题,如下。是GPU太小的问题吗?如果我用4个GPU该怎么分配呢,我在config里设置各分配两个,仍然会报错。 GPU +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:00:06.0 Off | 0 | | N/A 30C P0 28W / 250W | 11675MiB / 12198MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla P100-PCIE... On | 00000000:00:08.0 Off | 0 | | N/A 30C P0 29W / 250W | 7823MiB / 12198MiB | 0% Default | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 21430 C python 11665MiB | | 1 21430 C python 7813MiB | +-----------------------------------------------------------------------------+
报错信息: Avg-LOSS0/batch/step: 6.6137880611419675 Avg-LOSS1/batch/step: 3.8737828445434572 Avg-LOSS2/batch/step: 0.00037345796823501587 Avg-LOSS3/batch/step: 1.449139289855957 Avg-LOSS4/batch/step: 1.2904924607276917 100%|█████████████████████████████████████████| 962/962 [19:47<00:00, 1.06it/s] 1%|▎ | 2/232 [00:02<04:24, 1.15s/it] Exception in thread Thread-5: Traceback (most recent call last): File "/nesi/nobackup/uoa02874/anaconda3/lib/python3.7/threading.py", line 926, in bootstrap_inner self.run() File "train.py", line 59, in run join(args.prediction_path, 'pred_epoch{}.json'.format(epc))) File "train.py", line 122, in predict start, end, sp, Type, softmask, ent, yp1, yp2 = model(batch, return_yp=True) File "/home/zden658/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/scale_wlg_persistent/filesets/project/uoa02874/PycharmProjects/DFGN-pytorch-master/DFGN/model/GFN.py", line 59, in forward input_state, entity_state, softmask = self.basicblocks[l](input_state, query_vec, batch) File "/home/zden658/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/scale_wlg_persistent/filesets/project/uoa02874/PycharmProjects/DFGN-pytorch-master/DFGN/model/layers.py", line 245, in forward entity_state = self.tok2ent(doc_state, entity_mapping, entity_length) File "/home/zden658/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, **kwargs) File "/scale_wlg_persistent/filesets/project/uoa02874/PycharmProjects/DFGN-pytorch-master/DFGN/model/layers.py", line 46, in forward entity_states = entity_mapping.unsqueeze(3) * doc_state.unsqueeze(1) # N x E x L x d RuntimeError: CUDA out of memory. Tried to allocate 1.17 GiB (GPU 0; 11.91 GiB total capacity; 5.55 GiB already allocated; 523.38 MiB free; 5.15 GiB cached)
我进入config.py并将批处理大小设置为允许我继续的值。似乎gradient_accumulate_step = 5不会将批处理大小减小近乎足够。