ctrl
ctrl copied to clipboard
Out of memory when fine-tuning
Thank you for this important contribution!
I am trying to fine-tune your full model on a V100 with 16GB memory. Even when setting batch size to 1 in the patch, I seem to be running out of memory (see error below). Is there any way to fine-tune your model on a 16GB machine?
Thanks, Oren.
2019-10-14 20:27:40.672735: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocated_bytes_: 15753943296 memory_limit_: 15753943450 available bytes: 154 curr_region_allocation_bytes_: 31507887104 2019-10-14 20:27:40.672751: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats: Limit: 15753943450 InUse: 15753943296 MaxInUse: 15753943296 NumAllocs: 3949 MaxAllocSize: 1262254080
2019-10-14 20:27:40.672835: W tensorflow/core/common_runtime/bfc_allocator.cc:319] **************************************************************************************************** ERROR:tensorflow:Error recorded from training_loop: Dst tensor is not initialized. [[node save/RestoreV2 (defined at training.py:164) ]]
I'm not sure this is an OOM error. The training should succeed on a 16GB V100. Can you provide more details about the file you're fine-tuning, TF versions etc.?
Did the fine-tuning steps for Moby Dick succeed for you or did those fail as well?
I am using Python 3.7.4 (fresh Anaconda distribution) on an EC2 linux machine. tensorflow-gpu==1.14 with your Keras patch set to batch size 1
Running now with Moby Dick. Same situation. Pretty quickly training seems to hang after printing this warning:
2019-10-15 18:09:05.363842: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
The gpu utilization:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 | | N/A 41C P0 40W / 300W | 15469MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 4700 C python 15443MiB | +-----------------------------------------------------------------------------+
A while later (maybe an hour) I get the error I mentioned in my previous post and the program exits.
Yeah, I was able to replicate this. I was testing the fine-tuning on a 32GB V100 and it worked with higher batch sizes. Let me look into fine-tuning with lower memory. Now that we added CTRL to https://github.com/huggingface/transformers, I wonder if it is also worth trying that angle. I'll update once I have a solution.
@keskarnitish How do I run training.py on GPU? When I ran python training.py --model_dir ../seqlen256_v1.ckpt --iterations 250
, the model is on CPU by default.
Oh, my CUDA10.1 is not compatible with tensorflow-gpu 1.14.0.
After fixing this issue, I get the following:
2019-10-30 18:26:06.376093: W tensorflow/core/common_runtime/bfc_allocator.cc:319] ****************************************************************************************************
2019-10-30 18:26:06.376141: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at reduction_ops_common.h:180 : Resource exhausted: OOM when allocating tensor with shape[512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
ERROR:tensorflow:Error recorded from training_loop: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node encoder/encoder_layer_12/layer_normalization_24/moments/mean (defined at ../transformer.py:90) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[training/clip_by_global_norm/mul_1/_12367]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node encoder/encoder_layer_12/layer_normalization_24/moments/mean (defined at ../transformer.py:90) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node encoder/encoder_layer_12/layer_normalization_24/moments/mean:
encoder/encoder_layer_11/add_1 (defined at ../transformer.py:98)
Input Source operations connected to node encoder/encoder_layer_12/layer_normalization_24/moments/mean:
encoder/encoder_layer_11/add_1 (defined at ../transformer.py:98)
My system is Ubuntu 18.04 with Tesla V100 32GB (about 25GB is free) and tensorflow-gpu 1.14.0. I tried batch size of 4, 2, and 1.
While I explore this, I noticed a PR that seems to circumvent this issue (https://github.com/salesforce/ctrl/pull/51). I haven't tested this out but it might be a temporary solution.
Yeah, I can confirm I also can't get V100 16gb 8CPU, 30gb Ram, 100gb SSD to work with tensorflow-gpu==1.14 on the moby dick training example with batch_size = 1 and iterations 1. 256 model _v0
Can you recommend another GPU that could be good for training? Happy to try another. To my understanding NickWaltons fix manages multi-gpus but doesn't describe which ones?
Yeah, I can confirm I also can't get V100 16gb 8CPU, 30gb Ram, 100gb SSD to work with tensorflow-gpu==1.14 on the moby dick training example with batch_size = 1 and iterations 1. 256 model _v0
Can you recommend another GPU that could be good for training? Happy to try another. To my understanding NickWaltons fix manages multi-gpus but doesn't describe which ones?
Fine-tuning does work on the 32 GB GV100.
Yeah, I was able to replicate this. I was testing the fine-tuning on a 32GB V100 and it worked with higher batch sizes. Let me look into fine-tuning with lower memory. Now that we added CTRL to https://github.com/huggingface/transformers, I wonder if it is also worth trying that angle. I'll update once I have a solution.
About this (for general info): What tricks are usually applied to make a lower-memory branch like you did? I looked at the diff
with master
, and seems you reduced many tensors from float32
to flat16
. What else would you try?
Yeah, I was able to replicate this. I was testing the fine-tuning on a 32GB V100 and it worked with higher batch sizes. Let me look into fine-tuning with lower memory. Now that we added CTRL to https://github.com/huggingface/transformers, I wonder if it is also worth trying that angle. I'll update once I have a solution.
I get an OOM error on a 32GB V100