gpt-2 774M Model running out of memory

2019-08-20 22:56:50.301264: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8e6a00 next 222 of size 256                                                                                                                                      [71/1832]
2019-08-20 22:56:50.301278: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8e6b00 next 224 of size 5120     
2019-08-20 22:56:50.301307: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8e7f00 next 225 of size 256                      
2019-08-20 22:56:50.301330: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8e8000 next 227 of size 20480                     
2019-08-20 22:56:50.301339: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8ed000 next 228 of size 5120     
2019-08-20 22:56:50.301347: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8ee400 next 229 of size 5120     
2019-08-20 22:56:50.301355: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8ef800 next 230 of size 5120     
2019-08-20 22:56:50.301381: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8f0c00 next 231 of size 5120     
2019-08-20 22:56:50.301387: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8f2000 next 232 of size 5120     
2019-08-20 22:56:50.301399: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8f3400 next 233 of size 5120                     
2019-08-20 22:56:50.301408: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8f4800 next 234 of size 256      
2019-08-20 22:56:50.301416: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8f4900 next 236 of size 5120     
2019-08-20 22:56:50.301425: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8f5d00 next 237 of size 256      
2019-08-20 22:56:50.301433: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8f5e00 next 238 of size 15360    
2019-08-20 22:56:50.301442: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8f9a00 next 239 of size 256                                                                            
2019-08-20 22:56:50.301450: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8f9b00 next 242 of size 5120     
2019-08-20 22:56:50.301459: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8faf00 next 243 of size 256                         
2019-08-20 22:56:50.312661: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7f4cfe8fb000 next 18446744073709551615 of size 20480   
2019-08-20 22:56:50.312681: I tensorflow/core/common_runtime/bfc_allocator.cc:809]      Summary of in-use Chunks by size:                           
2019-08-20 22:56:50.312699: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 411 Chunks of size 256 totalling 102.8KiB         
2019-08-20 22:56:50.312710: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 1280 totalling 1.2KiB            
2019-08-20 22:56:50.312720: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 16 Chunks of size 4096 totalling 64.0KiB                          
2019-08-20 22:56:50.312732: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 940 Chunks of size 5120 totalling 4.59MiB         
2019-08-20 22:56:50.312741: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 144 Chunks of size 15360 totalling 2.11MiB                        
2019-08-20 22:56:50.312750: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 144 Chunks of size 20480 totalling 2.81MiB        
2019-08-20 22:56:50.312760: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 8 Chunks of size 81920 totalling 640.0KiB                         
2019-08-20 22:56:50.312770: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 8 Chunks of size 4194304 totalling 32.00MiB       
2019-08-20 22:56:50.312779: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 79 Chunks of size 5242880 totalling 395.00MiB     
2019-08-20 22:56:50.312789: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 5246976 totalling 5.00MiB        
2019-08-20 22:56:50.312798: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 143 Chunks of size 6553600 totalling 893.75MiB    
2019-08-20 22:56:50.312808: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 9134592 totalling 8.71MiB        
2019-08-20 22:56:50.312821: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 141 Chunks of size 19660800 totalling 2.58GiB                     
2019-08-20 22:56:50.312831: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 28 Chunks of size 20971520 totalling 560.00MiB                   
2019-08-20 22:56:50.312841: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 22806528 totalling 21.75MiB                                                                                         
2019-08-20 22:56:50.312850: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 283 Chunks of size 26214400 totalling 6.91GiB     
2019-08-20 22:56:50.312860: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 32243712 totalling 30.75MiB                      
2019-08-20 22:56:50.312869: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 32505856 totalling 31.00MiB      
2019-08-20 22:56:50.312879: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 2 Chunks of size 33554432 totalling 64.00MiB                      
2019-08-20 22:56:50.312888: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 7 Chunks of size 37748736 totalling 252.00MiB                     
2019-08-20 22:56:50.312898: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 38061056 totalling 36.30MiB                      
2019-08-20 22:56:50.312909: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 40894464 totalling 39.00MiB      
2019-08-20 22:56:50.312918: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 50394368 totalling 48.06MiB                                                        
2019-08-20 22:56:50.312927: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 28 Chunks of size 83886080 totalling 2.19GiB                                                                                                                                               
2019-08-20 22:56:50.312937: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 141774848 totalling 135.21MiB    
2019-08-20 22:56:50.312947: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 142606336 totalling 136.00MiB    
2019-08-20 22:56:50.312956: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 257315840 totalling 245.40MiB    
2019-08-20 22:56:50.312966: I tensorflow/core/common_runtime/bfc_allocator.cc:816] Sum Total of in-use chunks: 14.55GiB              
2019-08-20 22:56:50.312975: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocated_bytes_: 15652398592 memory_limit_: 15652398695 available bytes: 103 curr_region_allocation_bytes_: 17179869184
2019-08-20 22:56:50.312997: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats:                                            
Limit:                 15652398695                                                                                                   
InUse:                 15626903296                                                                                                   
MaxInUse:              15647874816                                                                                                                                                                         
NumAllocs:                    4685                                                                                                                   
MaxAllocSize:            257315840                                                                                                                      
                                                                                                                                     
2019-08-20 22:56:50.313127: W tensorflow/core/common_runtime/bfc_allocator.cc:319] ****************************************************************************************************
2019-08-20 22:56:50.313180: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[1,20,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocato
r GPU_0_bfc                                                                                                                          
Traceback (most recent call last):                                                                                                   
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call                            
    return fn(*args)                                                                                                                                  
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn            
    options, feed_dict, fetch_list, target_list, run_metadata)                                                                       
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)                                                                                                                    
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,20,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node gradients/model/h18/attn/truediv_1_grad/Neg}}]]                                                                                    
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
                                                                                                                                     
                                                                                                                                     
During handling of the above exception, another exception occurred:                                                                  
                                                                                                                                                                                                           
Traceback (most recent call last):                                                            
File "./train.py", line 291, in <module>                                                                                                           
    main()                                                                                                                                            
  File "./train.py", line 269, in main                                                                                               
    feed_dict={context: sample_batch()})                                                                                             
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run                 
    run_metadata_ptr)                                                                                                                
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run               
    feed_dict_tensor, options, run_metadata)                                                                                                         
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run            
    run_metadata)                                                                                                                    
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call           
    raise type(e)(node_def, op, message)                                                                                             
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,20,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node gradients/model/h18/attn/truediv_1_grad/Neg (defined at /home/surya/gpt-2/src/memory_saving_gradients.py:216) ]]     
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
                                                                                                                                                     
                                                                                                                                                    
Errors may have originated from an input operation.                                                                                  
Input Source operations connected to node gradients/model/h18/attn/truediv_1_grad/Neg:                                               
 model/h18/attn/Exp_1 (defined at /home/surya/gpt-2/src/memory_saving_gradients.py:204)                                                              
                                                                                                                                     
Original stack trace for 'gradients/model/h18/attn/truediv_1_grad/Neg':                                                                              
  File "./train.py", line 291, in <module>                                                                                           
    main()                                                                                                                                           
  File "./train.py", line 138, in main                                                                                               
    opt_grads = memory_saving_gradients.gradients(loss, train_vars)                                                                  
  File "/home/surya/gpt-2/src/memory_saving_gradients.py", line 216, in gradients                                                    
    dv = tf_gradients(ys=copied_ys, xs=boundary+xs, grad_ys=grad_ys, **kwargs)                                                       
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 158, in gradients       
    unconnected_gradients)                                                                                                                           
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py", line 731, in _GradientsHelper               
    lambda: grad_fn(op, *out_grads))                                                                                                                                                                                    
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py", line 403, in _MaybeCompile   
    return grad_fn()  # Exit early                                                                                                                   
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py", line 731, in <lambda>        
    lambda: grad_fn(op, *out_grads))                                                                                                                 
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/ops/math_grad.py", line 1147, in _RealDivGrad                        
    grad * math_ops.realdiv(math_ops.realdiv(-x, y), y), ry),                                                                                        
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 6633, in neg              
    "Neg", x=x, name=name)                                                                                                                                                             
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper                                                                                                                                   
    op_def=op_def)                                                                                                                   
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func          
    return func(*args, **kwargs)                                                                                                     
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op           
    op_def=op_def)                                                                                                                                                                                                      
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__            
    self._traceback = tf_stack.extract_stack()                                                                                       
                                                                                                                                     
...which was originally created as op 'model/h18/attn/truediv_1', defined at:                                                                                                                              
  File "./train.py", line 291, in <module>                                                                                                           
    main()                                                                                                                                              
[elided 0 identical lines from previous traceback]                                                                                   
  File "./train.py", line 138, in main                                                                                                                                                 
    opt_grads = memory_saving_gradients.gradients(loss, train_vars)                                                                                                                                                                                                           
  File "/home/surya/gpt-2/src/memory_saving_gradients.py", line 204, in gradients                                                    
    copied_sgv, info = ge.copy_with_input_replacements(ge.sgv(ops_to_copy), {})                                                      
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/contrib/graph_editor/transform.py", line 672, in copy_with_input_replacements
    sgv, dst_graph, dst_scope, src_scope, reuse_dst_scope=reuse_dst_scope)                                                                            
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/contrib/graph_editor/transform.py", line 452, in __call__   
    self._copy_ops(info)                                                                                                             
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/contrib/graph_editor/transform.py", line 466, in _copy_ops  
    op_, op_outputs_ = self.transform_op_handler(info, op, new_inputs)                                                               
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/contrib/graph_editor/transform.py", line 176, in copy_op_handler                                                                  
    [], input_types_, None, op_def_)                                                                                                                 
  File "/root/miniconda3/envs/tft/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__                               
    self._traceback = tf_stack.extract_stack()

Running on personal machine with GPUs/everything installed. Worked for 345M model well but getting into memory issues for 774.

I made sure memory saving gradients were on and batch size was just 1, any suggestions?

Aug 20 '19 23:08 sdan

Same here :+1:

Aug 22 '19 11:08 cagbal

We are looking for a solution here: https://github.com/minimaxir/gpt-2-simple/issues/108

Aug 23 '19 16:08 saippuakauppias

I was able to get it working on a Tesla P40 GPU (24GB), was still OOM with Adam but switching optimizer to sgd made it work.

Sep 06 '19 20:09 nkk0

I think there was another person who said they got it working on a 24GB GPU. The unfortunate part is gcloud only offers V100 at most and that's all I and many others have access to at the moment. That's why I'm trying to find new optimizers or ways we can distribute the memory (although historically it's been hard for fully connected transformers). More here https://github.com/dantuluri/gpt-2

Sep 06 '19 21:09 sdan

Yep it's really hard to find anything larger than 16GB but Azure does offer a 24GB instance (ND6s, in case someone wants to try it out).

Sep 06 '19 21:09 nkk0

Works fine with SGD on Titan RTX (24Gb).

Sep 13 '19 12:09 mgrankin

@mgrankin is there some kind of cloud provider that provides RTX?

Sep 13 '19 14:09 saippuakauppias

@saippuakauppias I don't know any. Nvidia discourages use of RTX in datacenters.

Sep 13 '19 15:09 mgrankin

Just FYI, I have been able to fine-tune a sub-set of variables under Colab with good results. My forked notebook (via Tenoke's fork) demonstrating how to do this is here: https://github.com/jkraybill/gpt-2/blob/finetuning/GPT2-finetuning2-774M.ipynb

I got the tip from an under-the-radar tweet by BasedBlue. (See comments here: https://github.com/minimaxir/gpt-2-simple/issues/108#issuecomment-533842411)

Sep 23 '19 05:09 jkraybill

@saippuakauppias FWIW I just learned today that Linode is running a paid pilot with 24GB Quadro RTX's. Haven't used them personally.

Sep 25 '19 23:09 jkraybill

@jkraybill Thanks! But it is not clear whether they can be rented on an hourly basis. For a month it is insanely expensive...

Sep 26 '19 07:09 saippuakauppias

I started fine-tuning the model on a 48GB GPU today, with adam optimizer it is using 33GB of memory, I have even been able to increase the batch size due to the extra capacity of the GPU.

Oct 29 '19 09:10 nkk0

@nkk0, Adafactor optimizer + checkpoints use ~8GB GPU RAM on 1 batch. Read all my comments in 108 issue in minimaxir repo (link above).

Oct 29 '19 11:10 saippuakauppias

gpt-2 gpt-2 copied to clipboard

774M Model running out of memory

gpt-2
gpt-2 copied to clipboard