ctrl Out of memory when fine-tuning

Thank you for this important contribution!

I am trying to fine-tune your full model on a V100 with 16GB memory. Even when setting batch size to 1 in the patch, I seem to be running out of memory (see error below). Is there any way to fine-tune your model on a 16GB machine?

Thanks, Oren.

2019-10-14 20:27:40.672735: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocated_bytes_: 15753943296 memory_limit_: 15753943450 available bytes: 154 curr_region_allocation_bytes_: 31507887104 2019-10-14 20:27:40.672751: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats: Limit: 15753943450 InUse: 15753943296 MaxInUse: 15753943296 NumAllocs: 3949 MaxAllocSize: 1262254080

2019-10-14 20:27:40.672835: W tensorflow/core/common_runtime/bfc_allocator.cc:319] **************************************************************************************************** ERROR:tensorflow:Error recorded from training_loop: Dst tensor is not initialized. [[node save/RestoreV2 (defined at training.py:164) ]]

Oct 14 '19 20:10 orenmelamud

I'm not sure this is an OOM error. The training should succeed on a 16GB V100. Can you provide more details about the file you're fine-tuning, TF versions etc.?

Did the fine-tuning steps for Moby Dick succeed for you or did those fail as well?

Oct 15 '19 17:10 keskarnitish

I am using Python 3.7.4 (fresh Anaconda distribution) on an EC2 linux machine. tensorflow-gpu==1.14 with your Keras patch set to batch size 1

Running now with Moby Dick. Same situation. Pretty quickly training seems to hang after printing this warning:

2019-10-15 18:09:05.363842: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.

The gpu utilization:

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 4700 C python 15443MiB | +-----------------------------------------------------------------------------+

A while later (maybe an hour) I get the error I mentioned in my previous post and the program exits.

Oct 15 '19 19:10 orenmelamud

Yeah, I was able to replicate this. I was testing the fine-tuning on a 32GB V100 and it worked with higher batch sizes. Let me look into fine-tuning with lower memory. Now that we added CTRL to https://github.com/huggingface/transformers, I wonder if it is also worth trying that angle. I'll update once I have a solution.

Oct 16 '19 23:10 keskarnitish

@keskarnitish How do I run training.py on GPU? When I ran python training.py --model_dir ../seqlen256_v1.ckpt --iterations 250, the model is on CPU by default.

Oh, my CUDA10.1 is not compatible with tensorflow-gpu 1.14.0.

After fixing this issue, I get the following:

2019-10-30 18:26:06.376093: W tensorflow/core/common_runtime/bfc_allocator.cc:319] ****************************************************************************************************
2019-10-30 18:26:06.376141: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at reduction_ops_common.h:180 : Resource exhausted: OOM when allocating tensor with shape[512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
ERROR:tensorflow:Error recorded from training_loop: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node encoder/encoder_layer_12/layer_normalization_24/moments/mean (defined at ../transformer.py:90) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[training/clip_by_global_norm/mul_1/_12367]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node encoder/encoder_layer_12/layer_normalization_24/moments/mean (defined at ../transformer.py:90) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node encoder/encoder_layer_12/layer_normalization_24/moments/mean:
 encoder/encoder_layer_11/add_1 (defined at ../transformer.py:98)

Input Source operations connected to node encoder/encoder_layer_12/layer_normalization_24/moments/mean:
 encoder/encoder_layer_11/add_1 (defined at ../transformer.py:98)

My system is Ubuntu 18.04 with Tesla V100 32GB (about 25GB is free) and tensorflow-gpu 1.14.0. I tried batch size of 4, 2, and 1.

Oct 30 '19 07:10 zhongpeixiang

While I explore this, I noticed a PR that seems to circumvent this issue (https://github.com/salesforce/ctrl/pull/51). I haven't tested this out but it might be a temporary solution.

Oct 30 '19 17:10 keskarnitish

Yeah, I can confirm I also can't get V100 16gb 8CPU, 30gb Ram, 100gb SSD to work with tensorflow-gpu==1.14 on the moby dick training example with batch_size = 1 and iterations 1. 256 model _v0

Can you recommend another GPU that could be good for training? Happy to try another. To my understanding NickWaltons fix manages multi-gpus but doesn't describe which ones?

Nov 04 '19 20:11 hypnoai

Yeah, I can confirm I also can't get V100 16gb 8CPU, 30gb Ram, 100gb SSD to work with tensorflow-gpu==1.14 on the moby dick training example with batch_size = 1 and iterations 1. 256 model _v0

Can you recommend another GPU that could be good for training? Happy to try another. To my understanding NickWaltons fix manages multi-gpus but doesn't describe which ones?

Fine-tuning does work on the 32 GB GV100.

Nov 04 '19 23:11 keskarnitish

Yeah, I was able to replicate this. I was testing the fine-tuning on a 32GB V100 and it worked with higher batch sizes. Let me look into fine-tuning with lower memory. Now that we added CTRL to https://github.com/huggingface/transformers, I wonder if it is also worth trying that angle. I'll update once I have a solution.

About this (for general info): What tricks are usually applied to make a lower-memory branch like you did? I looked at the diff with master, and seems you reduced many tensors from float32 to flat16. What else would you try?

Nov 24 '19 16:11 pgrandinetti

Yeah, I was able to replicate this. I was testing the fine-tuning on a 32GB V100 and it worked with higher batch sizes. Let me look into fine-tuning with lower memory. Now that we added CTRL to https://github.com/huggingface/transformers, I wonder if it is also worth trying that angle. I'll update once I have a solution.

I get an OOM error on a 32GB V100

Aug 13 '20 07:08 Heiheiyo

ctrl ctrl copied to clipboard

Out of memory when fine-tuning

ctrl
ctrl copied to clipboard