transformer-xl
transformer-xl copied to clipboard
Cuda out of memory
I have 6 titan GPUs machine with 12 GB memory, I changed the code to add my own dataset. However, I always get cuda out of memory:
Run training...
Experiment dir : /home/agemagician/Downloads/transformer-xl/pytorch/models/uniref50/base_v1-uniref50/20190419-023635
Loading cached dataset...
Traceback (most recent call last):
File "train.py", line 190, in <module>
device=device, ext_len=args.ext_len)
File "/home/agemagician/Downloads/transformer-xl/pytorch/data_utils.py", line 239, in get_iterator
data_iter = LMOrderedIterator(self.train, *args, **kwargs)
File "/home/agemagician/Downloads/transformer-xl/pytorch/data_utils.py", line 29, in __init__
self.data = data.view(bsz, -1).t().contiguous().to(device)
RuntimeError: CUDA out of memory. Tried to allocate 40.00 GiB (GPU 0; 11.75 GiB total capacity; 0 bytes already allocated; 11.08 GiB free; 0 bytes cached)
It doesn't matter whatever, I reduced the model size or the target length, or even add batch chunk. Here is my bash file:
#!/bin/bash
if [[ $1 == 'train' ]]; then
echo 'Run training...'
python train.py \
--cuda \
--data /media/agemagician/Disk2/projects/protin/dataset/uniref50_transformer_xl \
--dataset uniref50 \
--n_layer 12 \
--d_model 512 \
--n_head 8 \
--d_head 64 \
--d_inner 2048 \
--dropout 0.1 \
--dropatt 0.0 \
--optim adam \
--lr 0.00025 \
--warmup_step 10000 \
--max_step 400000 \
--tgt_len 200 \
--mem_len 200 \
--eval_tgt_len 128 \
--batch_size 24 \
--multi_gpu \
--varlen \
--gpu0_bsz 4 \
--fp16 \
--dynamic-loss-scale \
--batch_chun 4 \
${@:2}
elif [[ $1 == 'eval' ]]; then
echo 'Run evaluation...'
python eval.py \
--cuda \
--data /media/agemagician/Disk2/projects/protin/dataset/uniref50_transformer_xl \
--dataset uniref50 \
--tgt_len 80 \
--mem_len 4096 \
--clamp_len 820 \
--same_length \
--split test \
${@:2}
else
echo 'unknown argment 1'
fi
It seems the script wants to load the whole data file into the GPU memory at once.
I solved the problem by changing line number 29 in data_utils.py:
# Evenly divide the data across the bsz batches.
#self.data = data.view(bsz, -1).t().contiguous().to(device)
self.data = data.view(bsz, -1).t().contiguous().to('cpu')
Apparently, the train.py send cuda as a device and that was the issue.