gpt-2-simple
gpt-2-simple copied to clipboard
ResourceExhaustedError: failed to allocate memory
Hi everybody, I have an issue which I hadn't previously. I always trained GPT-2 on these settings without any issues, but now it doesn work. My settings: Colab Notebook (I'm using Colab Pro, but I also checked the same notebook with another google account without Colab Pro: the same issue) Model: 355m Dataset for training: around 3 Mb.
But during fine-tuning it shows me the error. Is something with general settings of GPT-2? Thank you!
For larger models, the recommended finetune() parameters are: use_memory_saving_gradients = True only_train_transformer_layers = True accumulate_gradients = 1
Loading checkpoint models/355M/model.ckpt INFO:tensorflow:Restoring parameters from models/355M/model.ckpt Loading dataset... 100%|██████████| 1/1 [00:23<00:00, 23.22s/it] dataset has 3308633 tokens Training...
ResourceExhaustedError Traceback (most recent call last) /usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args) 1374 try: -> 1375 return fn(*args) 1376 except errors.OpError as e:
7 frames /usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata) 1359 return self._call_tf_sessionrun(options, feed_dict, fetch_list, -> 1360 target_list, run_metadata) 1361
/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata) 1452 fetch_list, target_list, -> 1453 run_metadata) 1454
ResourceExhaustedError: failed to allocate memory [[{{node model/h18/ln_2/add_1}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
During handling of the above exception, another exception occurred:
ResourceExhaustedError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/gpt_2_simple/gpt_2.py in finetune(sess, dataset, steps, model_name, model_dir, combine, batch_size, learning_rate, accumulate_gradients, restore_from, run_name, checkpoint_dir, sample_every, sample_length, sample_num, multi_gpu, save_every, print_every, max_checkpoints, use_memory_saving_gradients, only_train_transformer_layers, optimizer, overwrite, reuse) 338 for _ in range(accumulate_gradients): 339 sess.run( --> 340 opt_compute, feed_dict={context: sample_batch()}) 341 (v_loss, v_summary) = sess.run((opt_apply, summary_loss)) 342 else:
/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata) 966 try: 967 result = self._run(None, fetches, feed_dict, options_ptr, --> 968 run_metadata_ptr) 969 if run_metadata: 970 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata) 1189 if final_fetches or final_targets or (handle and feed_dict_tensor): 1190 results = self._do_run(handle, final_targets, final_fetches, -> 1191 feed_dict_tensor, options, run_metadata) 1192 else: 1193 results = []
/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata) 1367 if handle is None: 1368 return self._do_call(_run_fn, feeds, fetches, targets, options, -> 1369 run_metadata) 1370 else: 1371 return self._do_call(_prun_fn, handle, feeds, fetches)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args) 1392 '\nsession_config.graph_options.rewrite_options.' 1393 'disable_meta_optimizer = True') -> 1394 raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter 1395 1396 def _extend_graph(self):
ResourceExhaustedError: failed to allocate memory [[node model/h18/ln_2/add_1 (defined at /usr/local/lib/python3.7/dist-packages/gpt_2_simple/src/model.py:67) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
Original stack trace for 'model/h18/ln_2/add_1':
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py", line 16, in
I'm having the same problem on a P100. I trained the 774M model a couple weeks ago and now I can't even train the 355M model on the same dataset.
As replied in another thread:
I made 355M work on Colab Pro but used the gpt-2-simple==0.7.2. For some reason even with installing this, I was still getting tensorflow>2, so in essence I ran these:
!pip install gpt-2-simple==0.7.2
!pip show tensorflow
!pip install tensorflow==1.15.2
!pip show tensorflow
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files
I did this so I could use
use_memory_saving_gradients=True,
only_train_transformer_layers=True
Which are not available in tensorflow>2.
I have no idea if this is the right approach though.
Here's the caveat: Tensorflow 1.15. 2 no longer has GPU support.Jan 30, 2020 :(.