tensor2tensor icon indicating copy to clipboard operation
tensor2tensor copied to clipboard

Out of Memory while training

Open dinosaxon opened this issue 5 years ago • 1 comments

I am getting an OoM error while training with 8 GPUs but not with 1 GPU.

I use the following command to train.

t2t-trainer
--data_dir=$DATA_DIR
--problem=$PROBLEM
--model=$MODEL
--hparams='max_length=100,batch_size=1024,eval_drop_long_sequences=true'
--worker_gpu=8
--train_steps=350000
--hparams_set=$HPARAMS
--eval_steps=5000
--output_dir=$TRAIN_DIR
--schedule=continuous_train_and_eval

Any suggestions? I also tried to reduce the batch_size as well as the max_length but no luck.

dinosaxon avatar Sep 08 '20 13:09 dinosaxon

same question. It seems reducing batch_size does not make a differnce.

shizhediao avatar Oct 20 '22 14:10 shizhediao