LM-LSTM-CRF icon indicating copy to clipboard operation
LM-LSTM-CRF copied to clipboard

Asking for the cuda OOM questions

Open gonghouyu opened this issue 6 years ago • 3 comments

I run the code on a Chinese ner train data(around 70 thousand sentences, and I set the LM-LSTM-crf to co-train model), and I got the OMM error:

When I set the batch_size to 10, it results in:

  • Tot it 6916 (epoch 0): 6308it [26:09, 4.02it/s]THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory Traceback (most recent call last): File "train_wc.py", line 243, in loss.backward() File "/usr/local/lib/python3.5/site-packages/torch/autograd/variable.py", line 167, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables) File "/usr/local/lib/python3.5/site-packages/torch/autograd/init.py", line 99, in backward variables, grad_variables, retain_graph) RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58

When I set the batch_size to 128, it results in:

  • Tot it 543 (epoch 0): 455it [03:57, 1.91it/s]THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory

Traceback (most recent call last): File "train_wc.py", line 241, in loss = loss + args.lambda0 * crit_lm(cbs, cf_y.view(-1)) File "/usr/local/lib/python3.5/site-packages/torch/nn/modules/module.py", line325, in call result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.5/site-packages/torch/nn/modules/loss.py", line 601, in forward self.ignore_index, self.reduce) File "/usr/local/lib/python3.5/site-packages/torch/nn/functional.py", line 1140, in cross_entropy return nll_loss(log_softmax(input, 1), target, weight, size_average, ignore_index, reduce) File "/usr/local/lib/python3.5/site-packages/torch/nn/functional.py", line 786, in log_softmax return torch._C._nn.log_softmax(input, dim) RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58

Could any one give me some advise to solve it?

gonghouyu avatar Jun 19 '18 05:06 gonghouyu

Hi, what type of GPU you are using, and how large is its memory?

For chinese, even the character-level language modeling would result in a large dictionary (and also large GPU memory consumptions). One way to alleviate this problem is to filter some low-frequency words as unknown tokens.

LiyuanLucasLiu avatar Jun 19 '18 06:06 LiyuanLucasLiu

The type of GPU is Tesla K40c, We have 4 piece and each has 10 Memory. Both of using only one GPU or set it to multi-GPU in the pytorch code have the same OOM error. And set mini_count to 5 even 10 also doesn't work. But if I do not use the co_train, it works well~

gonghouyu avatar Jun 20 '18 12:06 gonghouyu

Yes, language modeling for chinese is a little tricky. I think it's necessary to do some model modification to make it work.

LiyuanLucasLiu avatar Jun 20 '18 17:06 LiyuanLucasLiu