KrantikariQA icon indicating copy to clipboard operation
KrantikariQA copied to clipboard

When running, the error: CUDA error out of memory

Open simba0626 opened this issue 5 years ago • 7 comments

Hi, sorry to trouble you again. When I run : CUDA_VISIBLE_DEVICES=1 python corechain.py -model slotptr -device cuda -dataset lcquad -pointwise True 1

The error line: loss.backward() My GPU memory : 10G.

Thank you for your help.

simba0626 avatar Oct 11 '19 01:10 simba0626

Sorry, trouble you, again. After I set pretrain embedding requires_grad = False, it is ok. In detail, below: if vectors is not None: self.embedding_layer = nn.Embedding.from_pretrained(torch.FloatTensor(vectors)) self.embedding_layer.weight.requires_grad = False #True

It means its embeddings is non-trainable. Does this setup influence to re-implement experiment results ?

Thank you.

simba0626 avatar Oct 13 '19 03:10 simba0626

It would affect the results. Can you tell me the batch size and other related hyperparameters? Also, can you run it with pointwise False?

saist1993 avatar Oct 13 '19 11:10 saist1993

Hi, batch size = 4000, epoches = 100, other related hyperparameters are the same as the source. In specific:

[lcquad] _neg_paths_per_epoch_train = 100 _neg_paths_per_epoch_validation = 1000 total_negative_samples = 1000 batch_size = 4000 hidden_size = 256 number_of_layer = 1 embedding_dim = 300 vocab_size = 15000 dropout = 0.5 dropout_rec = 0.3 dropout_in = 0.3 output_dim = 300 rel_pad = 25 relsp_pad = 12 relrd_pad = 2

I run command line: CUDA_VISIBLE_DEVICES=1 python corechain.py -model slotptr -device cuda -dataset lcquad -pointwise False

simba0626 avatar Oct 14 '19 00:10 simba0626

Sorry, trouble you. The result is below: BestValiAcc: 0.654. BestTestAcc: 0.664 In addition, when evaluate, RuntimeError: CUDA error: out of memory. a

Would you help me to solve it ? Thank you

simba0626 avatar Oct 14 '19 00:10 simba0626

I think there is happening because the file is trying to load another slot pointer instance while there is already one slot pointer instance in the memory. This will not affect the final result much as the best performing model (one with the highest validation accuracy) gets stored in the disk. I have highlighted the best accuracy result in the image.

You can run onefile.py with appropriate params to load the model and re-run the eval. I will also recommend to run it for a little longer epoch as it looks like the model has not converged.

image

saist1993 avatar Oct 14 '19 09:10 saist1993

ok, I have a try. But I have a question: how much epoch should be set ?

thanks

simba0626 avatar Oct 14 '19 13:10 simba0626

I find 300 epochs in the paper. I have a try it.

simba0626 avatar Oct 15 '19 02:10 simba0626