gpt-2-simple icon indicating copy to clipboard operation
gpt-2-simple copied to clipboard

ResourceExhaustedError: failed to allocate memory

Open Merzmensch opened this issue 3 years ago • 2 comments

Hi everybody, I have an issue which I hadn't previously. I always trained GPT-2 on these settings without any issues, but now it doesn work. My settings: Colab Notebook (I'm using Colab Pro, but I also checked the same notebook with another google account without Colab Pro: the same issue) Model: 355m Dataset for training: around 3 Mb.

But during fine-tuning it shows me the error. Is something with general settings of GPT-2? Thank you!


For larger models, the recommended finetune() parameters are: use_memory_saving_gradients = True only_train_transformer_layers = True accumulate_gradients = 1

Loading checkpoint models/355M/model.ckpt INFO:tensorflow:Restoring parameters from models/355M/model.ckpt Loading dataset... 100%|██████████| 1/1 [00:23<00:00, 23.22s/it] dataset has 3308633 tokens Training...

ResourceExhaustedError Traceback (most recent call last) /usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args) 1374 try: -> 1375 return fn(*args) 1376 except errors.OpError as e:

7 frames /usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata) 1359 return self._call_tf_sessionrun(options, feed_dict, fetch_list, -> 1360 target_list, run_metadata) 1361

/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata) 1452 fetch_list, target_list, -> 1453 run_metadata) 1454

ResourceExhaustedError: failed to allocate memory [[{{node model/h18/ln_2/add_1}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

During handling of the above exception, another exception occurred:

ResourceExhaustedError Traceback (most recent call last) in () 9 print_every=10, 10 sample_every=200, ---> 11 save_every=500 12 )

/usr/local/lib/python3.7/dist-packages/gpt_2_simple/gpt_2.py in finetune(sess, dataset, steps, model_name, model_dir, combine, batch_size, learning_rate, accumulate_gradients, restore_from, run_name, checkpoint_dir, sample_every, sample_length, sample_num, multi_gpu, save_every, print_every, max_checkpoints, use_memory_saving_gradients, only_train_transformer_layers, optimizer, overwrite, reuse) 338 for _ in range(accumulate_gradients): 339 sess.run( --> 340 opt_compute, feed_dict={context: sample_batch()}) 341 (v_loss, v_summary) = sess.run((opt_apply, summary_loss)) 342 else:

/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata) 966 try: 967 result = self._run(None, fetches, feed_dict, options_ptr, --> 968 run_metadata_ptr) 969 if run_metadata: 970 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata) 1189 if final_fetches or final_targets or (handle and feed_dict_tensor): 1190 results = self._do_run(handle, final_targets, final_fetches, -> 1191 feed_dict_tensor, options, run_metadata) 1192 else: 1193 results = []

/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata) 1367 if handle is None: 1368 return self._do_call(_run_fn, feeds, fetches, targets, options, -> 1369 run_metadata) 1370 else: 1371 return self._do_call(_prun_fn, handle, feeds, fetches)

/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args) 1392 '\nsession_config.graph_options.rewrite_options.' 1393 'disable_meta_optimizer = True') -> 1394 raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter 1395 1396 def _extend_graph(self):

ResourceExhaustedError: failed to allocate memory [[node model/h18/ln_2/add_1 (defined at /usr/local/lib/python3.7/dist-packages/gpt_2_simple/src/model.py:67) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

Original stack trace for 'model/h18/ln_2/add_1': File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py", line 16, in app.launch_new_instance() File "/usr/local/lib/python3.7/dist-packages/traitlets/config/application.py", line 846, in launch_instance app.start() File "/usr/local/lib/python3.7/dist-packages/ipykernel/kernelapp.py", line 499, in start self.io_loop.start() File "/usr/local/lib/python3.7/dist-packages/tornado/platform/asyncio.py", line 132, in start self.asyncio_loop.run_forever() File "/usr/lib/python3.7/asyncio/base_events.py", line 541, in run_forever self._run_once() File "/usr/lib/python3.7/asyncio/base_events.py", line 1786, in _run_once handle._run() File "/usr/lib/python3.7/asyncio/events.py", line 88, in _run self._context.run(self._callback, *self._args) File "/usr/local/lib/python3.7/dist-packages/tornado/platform/asyncio.py", line 122, in _handle_events handler_func(fileobj, events) File "/usr/local/lib/python3.7/dist-packages/tornado/stack_context.py", line 300, in null_wrapper return fn(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py", line 452, in _handle_events self._handle_recv() File "/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py", line 481, in _handle_recv self._run_callback(callback, msg) File "/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py", line 431, in _run_callback callback(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/tornado/stack_context.py", line 300, in null_wrapper return fn(args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher return self.dispatch_shell(stream, msg) File "/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell handler(stream, idents, msg) File "/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request user_expressions, allow_stdin) File "/usr/local/lib/python3.7/dist-packages/ipykernel/ipkernel.py", line 208, in do_execute res = shell.run_cell(code, store_history=store_history, silent=silent) File "/usr/local/lib/python3.7/dist-packages/ipykernel/zmqshell.py", line 537, in run_cell return super(ZMQInteractiveShell, self).run_cell(args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell interactivity=interactivity, compiler=compiler, result=result) File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2828, in run_ast_nodes if self.run_code(code, result): File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 11, in save_every=500 File "/usr/local/lib/python3.7/dist-packages/gpt_2_simple/gpt_2.py", line 198, in finetune output = model.model(hparams=hparams, X=context, gpus=gpus, reuse=reuse) File "/usr/local/lib/python3.7/dist-packages/gpt_2_simple/src/model.py", line 203, in model h, present = block(h, 'h%d' % layer, past=past, hparams=hparams) File "/usr/local/lib/python3.7/dist-packages/gpt_2_simple/src/model.py", line 158, in block m = mlp(norm(x, 'ln_2'), 'mlp', nx4, hparams=hparams) File "/usr/local/lib/python3.7/dist-packages/gpt_2_simple/src/model.py", line 67, in norm x = xg + b File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1367, in binary_op_wrapper return func(x, y, name=name) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper return target(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1700, in _add_dispatch return gen_math_ops.add_v2(x, y, name=name) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 465, in add_v2 "AddV2", x=x, y=y, name=name) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper attrs=attr_protos, op_def=op_def) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 3569, in _create_op_internal op_def=op_def) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 2045, in init self._traceback = tf_stack.extract_stack_for_node(self._c_op)

Merzmensch avatar Oct 23 '21 11:10 Merzmensch

I'm having the same problem on a P100. I trained the 774M model a couple weeks ago and now I can't even train the 355M model on the same dataset.

keysmashed avatar Oct 24 '21 01:10 keysmashed

As replied in another thread:

I made 355M work on Colab Pro but used the gpt-2-simple==0.7.2. For some reason even with installing this, I was still getting tensorflow>2, so in essence I ran these:

!pip install gpt-2-simple==0.7.2
!pip show tensorflow
!pip install tensorflow==1.15.2
!pip show tensorflow
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

I did this so I could use

              use_memory_saving_gradients=True,
              only_train_transformer_layers=True

Which are not available in tensorflow>2.

I have no idea if this is the right approach though.

Here's the caveat: Tensorflow 1.15. 2 no longer has GPU support.Jan 30, 2020 :(.

dean-dalianis avatar Nov 11 '21 15:11 dean-dalianis