CDial-GPT RuntimeError: CUDA out of memory.

python train.py --pretrained --model_checkpoint thu-coai/CDial-GPT_LCCC-large --data_path data/STC.json --scheduler linear。你好请问我的内存明明是够的，它为啥还报这个错误呢。batch_size我也改成了1. RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.73 GiB total capacity; 904.23 MiB already allocated; 26.38 MiB free; 1020.00 MiB reserved in total by PyTorch) Epoch: [63/4391266] 0%| , loss=0.0535, lr=5e-5 [00:09<174:20:29 每次到63就结束了，请问4391266代表什么意思呢？可以缩小这个数值吗

Aug 02 '22 11:08 Deerzh

我也遇到了这个问题，请问解决了吗？

Sep 02 '22 11:09 ai408

我也遇到了这个问题，请问解决了吗？兄弟，目前我也没解决这个问题😂。我猜可能是数据集太大了，要是能缩小数据集估计能解决，但我不知道咋缩小

Sep 02 '22 11:09 Deerzh

我在尝试这个https://github.com/thu-coai/EVA

Sep 02 '22 11:09 ai408

我使用的数据量不是很大

Sep 02 '22 11:09 ai408

那就不清楚了，可能需要作者解决一下

Sep 02 '22 11:09 Deerzh

貌似EVA对显存要求更高。

Sep 02 '22 11:09 ai408

修改num_workers为1就好了

Sep 02 '22 11:09 ai408

您好，您所使用的GPU显存可能有点小。碰到比较长的序列的话有可能因为要记录的激活太多导致OOM。您可以考虑限定一下训练过程中的最长序列长度，或者换一个大一点显存的显卡。

Sep 03 '22 08:09 silverriver

修改num_workers为1就好了

num_workers 是 pytorch中DataLoader的参数，用来控制用多少个CPU进程来加载数据，这个数值的大小不会影响模型显存的占用的。

Sep 03 '22 08:09 silverriver

你好，请问如何缩小epoch呢？我在train.py中将--n_epochs改为1，为啥运行的时候还是这么大呢？ Epoch: [1709/2195633] 0%| , loss=0.0528, lr=5e-5 [01:48<38:44:15

Sep 06 '22 13:09 Deerzh

这个应该是修改batchsize吧

Sep 07 '22 04:09 ai408

tesla v100 上跑一样out of memory.穷diaosi还是不要用了

Dec 22 '22 07:12 chenjh880730

CDial-GPT CDial-GPT copied to clipboard

RuntimeError: CUDA out of memory.

CDial-GPT
CDial-GPT copied to clipboard