gpt-2-simple icon indicating copy to clipboard operation
gpt-2-simple copied to clipboard

OOM error with new 774M model when running in Colab

Open ghost opened this issue 4 years ago • 77 comments

When running sess command, getting OOM issue. Not sure if new large model is too large for Colab?

WARNING: Logging before flag parsing goes to stderr. W0820 16:58:18.137592 140704259733376 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/gpt_2_simple/src/sample.py:17: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where

ResourceExhaustedError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args) 1355 try: -> 1356 return fn(*args) 1357 except errors.OpError as e:

7 frames ResourceExhaustedError: OOM when allocating tensor with shape[50257,1280] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node model/wte/Initializer/random_normal/RandomStandardNormal}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

ResourceExhaustedError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args) 1368 pass 1369 message = error_interpolation.interpolate(message, self._graph) -> 1370 raise type(e)(node_def, op, message) 1371 1372 def _extend_graph(self):

ResourceExhaustedError: OOM when allocating tensor with shape[50257,1280] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node model/wte/Initializer/random_normal/RandomStandardNormal (defined at /usr/local/lib/python3.6/dist-packages/gpt_2_simple/src/model.py:185) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Original stack trace for 'model/wte/Initializer/random_normal/RandomStandardNormal': File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 16, in app.launch_new_instance() File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 658, in launch_instance app.start() File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py", line 477, in start ioloop.IOLoop.instance().start() File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 888, in start handler_func(fd_obj, events) File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper return fn(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events self._handle_recv() File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv self._run_callback(callback, msg) File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback callback(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper return fn(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher return self.dispatch_shell(stream, msg) File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell handler(stream, idents, msg) File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request user_expressions, allow_stdin) File "/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py", line 196, in do_execute res = shell.run_cell(code, store_history=store_history, silent=silent) File "/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py", line 533, in run_cell return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell interactivity=interactivity, compiler=compiler, result=result) File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2828, in run_ast_nodes if self.run_code(code, result): File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 12, in save_every=500 File "/usr/local/lib/python3.6/dist-packages/gpt_2_simple/gpt_2.py", line 170, in finetune output = model.model(hparams=hparams, X=context) File "/usr/local/lib/python3.6/dist-packages/gpt_2_simple/src/model.py", line 185, in model initializer=tf.compat.v1.random_normal_initializer(stddev=0.02)) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 1496, in get_variable aggregation=aggregation) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 1239, in get_variable aggregation=aggregation) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 562, in get_variable aggregation=aggregation) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 514, in _true_getter aggregation=aggregation) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 929, in _get_single_variable aggregation=aggregation) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 259, in call return cls._variable_v1_call(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 220, in _variable_v1_call shape=shape) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 198, in previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 2511, in default_variable_creator shape=shape) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 263, in call return super(VariableMetaclass, cls).call(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 1568, in init shape=shape) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 1698, in _init_from_args initial_value(), name="initial_value", dtype=dtype) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 901, in partition_info=partition_info) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/init_ops.py", line 323, in call shape, self.mean, self.stddev, dtype, seed=self.seed) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/random_ops.py", line 79, in random_normal shape_tensor, dtype, seed=seed1, seed2=seed2) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_random_ops.py", line 728, in random_standard_normal seed2=seed2, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in init self._traceback = tf_stack.extract_stack()

ghost avatar Aug 20 '19 17:08 ghost

It is likely not possible to finetune 774M. Discussion here: https://news.ycombinator.com/item?id=20749037

I need to run tests to determine how well it works; if it's not possible, I'll add a bespoke assert to prevent finetuning on it.

minimaxir avatar Aug 20 '19 19:08 minimaxir

Would using fp16 help?

sdan avatar Aug 21 '19 07:08 sdan

Maybe it can be run at: https://cloud.google.com/compute/all-pricing#gpus ? What should be the minimum configuration? Or are there any places where it will cost less? Is it possible to use repo: https://github.com/minimaxir/gpt-2-cloud-run ?

saippuakauppias avatar Aug 21 '19 13:08 saippuakauppias

There is no magic switch for FP16 in TensorFlow [yet], and the 16 GB VRAM offered by cloud GPUs still isn't enough.

If there are any workarounds, I would be interested in them.

minimaxir avatar Aug 21 '19 14:08 minimaxir

Need to recompile tensorflow to use fp16? I have some experience in this, I can tell you how to do it without any special difficulties.

saippuakauppias avatar Aug 21 '19 14:08 saippuakauppias

For reference, this was @AdamDanielKing's answer on HackerNews:

TalkToTransformer.com uses preemptible P4 GPUs on Google Kubernetes Engine. Changing the number of workers and automatically restarting them when they're preempted is easy with Kubernetes.

To provide outputs incrementally rather than waiting for the entire sequence to be generated, I open a websocket to a a worker and have it do a few tokens at a time, sending the output back as it goes. GPT-2 tokens can end partway through a multi-byte character, so to make this work you need to send the raw UTF-8 bytes to the browser and then have it concatenate them before decoding the string.

While my workers can batch requests from multiple users, the modest increase in performance is probably not worth the complexity in most cases.

I won't say that I understand everything though.

woctezuma avatar Aug 21 '19 14:08 woctezuma

@woctezuma That comment only explains how to deploy a trained model, which requires much less GPU memory than training because the gradients aren't stored. @minimaxir is probably right that for training you won't fit a full batch of 774M training samples in the K80 GPU that Colab gives you.

@minimaxir You can work around this by training with a smaller batch size but accumulating gradients over several iterations before applying an update to the weights. That achieves a larger effective batch size than can fit in the GPU. This page might be helpful.

AdamDanielKing avatar Aug 21 '19 15:08 AdamDanielKing

The workflow for 345M finetuning uses a batch size of 1 w/ accumulated gradients. That is the workflow the 774M should be using now, with apparently no success.

minimaxir avatar Aug 21 '19 15:08 minimaxir

Ah, I see. That's surprising.

I know OpenAI uses gradient checkpointing for some of their other work, so in that case I'd bet they use it in their training code for GPT-2 as well. See https://github.com/cybertronai/gradient-checkpointing. Instead of storing all the layer activations at once, this stores a subset of them and then recomputes them during the backward pass to significantly reduce memory usage. In my experience it's pretty easy to get that library working, and if you do then it should be effective.

Another workaround is to only train with sequences significantly shorter than the maximum of 1024 tokens.

AdamDanielKing avatar Aug 21 '19 15:08 AdamDanielKing

Maybe should turn back and try to implement using TPU or Multiple GPU?

In your opinion, which option would be preferable to not return to this issue in the future (when 1558M parameter model will be released)? (I assume that Colab may not be enough for this, of course)

saippuakauppias avatar Aug 21 '19 16:08 saippuakauppias

@AdamDanielKing, This repo took a good chunk of nshepperd's codebase as @minimaxir has said in the past. This means this repo automatically does gradient checkpointing for anything that is not 117 (see gpt_2.py).

@saippuakauppias I already tried it on all GPU configurations/RAM/CPU configurations on GCP. After 10-15 failed attempts did I realize it was an issue and was prompted to HN to see @minimaxir and @AdamDanielKing's discussion.

And @saippuakauppias someone has already tried TPU: [https://colab.research.google.com/github/shawwn/gpt-2/blob/tpu/Training_GPT_2_Using_TPUs.ipynb](Colab using TPU). At this moment I didn't get the best results, although I have to do some data preprocessing to see what the exact issue it. I'm also getting a pretty high loss on it.

@saippuakauppias can you help me recompile TF to only use FP16?

sdan avatar Aug 21 '19 18:08 sdan

@dantuluri Thanks for pointing this out. It looks like the code only uses 1 gradient checkpoint at layer 10:

https://github.com/minimaxir/gpt-2-simple/blob/4c36ea73164cdf0f15b39f02dbefa8eef96f671e/gpt_2_simple/src/model.py#L195-L196

The code is using memory_saving_gradients in 'collection' mode, so it doesn't automatically add any other checkpoints. 774M has 36 layers, so this means the activations of at least 26 layers will be in memory at the same time. I'd suggest adding many more checkpoints or trying the other modes.

AdamDanielKing avatar Aug 21 '19 19:08 AdamDanielKing

@dantuluri, I misunderstood the discussion on "Hacker News" (recompilation is not needed). FP16 is already available in tensorflow version 1.14:

  • https://developer.nvidia.com/automatic-mixed-precision
  • https://medium.com/tensorflow/automatic-mixed-precision-in-tensorflow-for-faster-ai-training-on-nvidia-gpus-6033234b2540
  • https://github.com/tensorflow/tensorflow/blob/v1.14.0/tensorflow/python/training/experimental/mixed_precision.py

Can anyone check if this helps for a Colab or for Cloud Run?

PS: if you suddenly need to recompile TF, then here is the easiest way: https://github.com/yaroslavvb/tensorflow-community-wheels/pull/121/files

saippuakauppias avatar Aug 21 '19 20:08 saippuakauppias

If FP16 is indeed in TensorFlow 1.14 via pip, I'll give it a test.

minimaxir avatar Aug 21 '19 21:08 minimaxir

Looking into the code it seems @minimaxir used https://github.com/cybertronai/gradient-checkpointing for gradient checkpointing.

I used the variations:

  • collection (which appears to be default)
  • speed (ran into the same OOM problems)
  • memory (ran into: 'unable to find bottleneck tensors! please provide checkpoint nodes manually, or use checkpoints="speed"')

Here are the definitions of each variation just for reference:

'collection' (default): This checkpoints all tensors returned by tf.get_collection('checkpoints'). You then need to make sure you add tensors to this collection using tf.add_to_collection('checkpoints', tensor) when you define your model.
'memory' : This uses a heuristic to automatically select a set of nodes to checkpoint which achieves our desired O(sqrt(n)) memory usage. The heuristic works by automatically identifying articulation points in the graph, i.e. tensors which split the graph into two disconnected parts when removed, and then checkpointing a suitable number of these tensors. This currently works well for many, but not all, models.
'speed' : This option tries to maximize running speed by checkpointing the outputs of all ops that are typically expensive to compute, namely convolutions and matrix multiplies.

I think FP16 is probably the way to go if it works as @minimaxir said.

sdan avatar Aug 21 '19 21:08 sdan

@saippuakauppias do you know how to use FP16? Not too familiar on how to start using it.

sdan avatar Aug 21 '19 21:08 sdan

@dantuluri No, but now I'm trying to figure out how to use it.

An example of how to enable FP16: https://colab.research.google.com/github/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/docs/amp/notebook_v1.14/auto_mixed_precision_demo_cifar10.ipynb

You just need to wrap the optimizer in tensorflow.compat.v1.train.experimental.enable_mixed_precision_graph_rewrite and that's it!

Documentation: https://www.tensorflow.org/api_docs/python/tf/train/experimental/enable_mixed_precision_graph_rewrite https://gist.github.com/tlkh/fa20c5bf3c8b48def4501cccff8b3559

saippuakauppias avatar Aug 21 '19 21:08 saippuakauppias

There should definitely still be more gradient checkpoints. I just ran some tests on a K80 setting accumulate_gradients to 1 and seeing how many samples can fit in a batch without running out of memory.

Model Checkpointing just layer 10 Checkpointing all layer outputs
345M 1 sample fits 8 samples fit

The only code change is removing the if layer == 10: line. This makes the the large internal activations of each layer (attention layer, MLP layer) be recomputed, with only the skip connections between each layer being stored. Still, the optimal strategy is likely to be a bit different from this.

Unfortunately I'm still struggling to fit 1 sample of 774M into memory mainly because the attn function inside each layer requires a lot of memory.

Edit: By the way, adding more checkpoints doesn't have a performance hit because its only effect is to not deallocate and recompute the checkpointed layer. So you just want to choose the checkpoints in a way that minimizes the peak memory usage.

AdamDanielKing avatar Aug 22 '19 01:08 AdamDanielKing

How does nshepperd's fork deal with this? It seems like he puts gradient checkpointing at all layers: if args.accumulate_gradients > 1: not sure though.

sdan avatar Aug 22 '19 02:08 sdan

@dantuluri He also has the if layer == 10: line, so only one checkpoint. When memory_saving_gradients is in 'collection' mode (the default) it only uses checkpoints that you explicitly add.

AdamDanielKing avatar Aug 22 '19 02:08 AdamDanielKing

Interesting. Going through speed, memory, and collection modes removing if layer == 10. Will update on once done (running on 16 vram V100)

sdan avatar Aug 22 '19 02:08 sdan

Update: speed and memory options don't work. Just collection works (by default). All you need to do is delete the if layer == 10 line` (I've tried if layer == 5 and 2 and still didn't work) as @AdamDanielKing said.

Currently running on V100. Will try on lower VRAM GPUs.

Edit: Can't vouch for quality of training. Just saw it training and thought it works. Edit: Running on P100 is works fine.

sdan avatar Aug 22 '19 05:08 sdan

@dantuluri Perfect! This is with 774M, right? How many samples fit if you set accumulate_gradients to 1 and vary batch_size? Can you get more than one?

I think the boundary between not fitting a sample and fitting one is between 12 GB and 16 GB. I wasn't able to get one of the K80s that Google offers (12 GB) to work. So it seems we still can't train for free on Colab.

Edit: One place in particular that seemed helpful to add a checkpoint was at the model's output:

https://github.com/minimaxir/gpt-2-simple/blob/4c36ea73164cdf0f15b39f02dbefa8eef96f671e/gpt_2_simple/gpt_2.py#L170

I suggest experimenting with adding it

    output = model.model(hparams=hparams, X=context)
    tf.compat.v1.add_to_collection('checkpoints', output['logits'])

and seeing if that increases the number of samples you can fit on the GPU. Edit Sept 21: The line above had a bug but should work now. Still not certain that it lowers the overall memory usage but it's worth trying.

Memory peaks around there, and checkpointing seemed to bring the peak usage down while I was playing with the K80.

AdamDanielKing avatar Aug 22 '19 06:08 AdamDanielKing

Updated my code suggestion one last time. ^

AdamDanielKing avatar Aug 22 '19 06:08 AdamDanielKing

At this moment it's training, accumulate_gradients = 1 and batch_size = 1 as default.

I'm think my input may be wrong because it's structured like this:

<|startoftext|>
hello world
more text more text more text
more text more text more text
<|endoftext|>
<|startoftext|>
more text more text more text
more text more text more text
<|endoftext|>

and so on. But I'm getting the start and end tags in my results... like this:

something
something
something
<|endoftext|>
<|startoftext|>
something
something
sometimes weird characters

Because I'm a bit more familiar with nshepperd's code, do you know where he did the checkpointing (layer == 10)? I got better results training 335 using his code.

Otherwise, any help on getting this code to work with my data would be much appreciated.

In regards to GPU memory usage: I'm using a P100 which has 16GB VRAM. When training, regardless of model (335 or 774) it always maxes out, to around 99% all the time (except when generating samples, when it goes to around 50%).

In regards to loss The loss is really low compared to 335. I'm getting around 1.5 out of the gate, opposed to around high 2's with 335.

Quality of results For the short time I've been training it, it's not a whole lot greater than 335. This will hopefully change. And as said before, the <|endoftext|> <|startoftext|> tags are somewhat annoying when in the middle of the results... not to mention... where can I delete ======== SAMPLE 1 ========? It's always showing up in all my samples. And when the program saves these samples, it doesn't save them in .txt files, just samples-100 with no extension. With nshepperd's code I could easily make these adjustments. I'm not sure where I can make them in this code.

In regards to your suggestion Haven't tried output = model.model(hparams=hparams, X=context) tf.compat.v1.add_to_collection('checkpoints', output)will update when I get I get the data issues out of the way.

sdan avatar Aug 22 '19 20:08 sdan

Maybe @nshepperd already tried to solve this problem too?

saippuakauppias avatar Aug 23 '19 16:08 saippuakauppias

I've been trying to finetune 774M using nshepperd's fork on a ~~p3.2xlarge~~ EDIT: p3.xlarge (12GB), and after trying the checkpointing suggestions here I was still running out of memory. But I got it to run (at least on batches of size 1, haven't tried larger) by changing the optimizer from Adam to vanilla SGD, which I assume has a smaller memory footprint because it lacks Adam's moving averages.

It hasn't run long enough for me to really assess whether vanilla SGD is good enough for the finetuning I want, though. I'll also probably have to play with the learning rate a bit relative to what I used with Adam.

rfriel avatar Aug 24 '19 18:08 rfriel

@rfriel You mention p3.2xlarge having a 12 GB GPU–is this a typo? It should have a 16 GB V100. I wasn't able to fit 774M into only 12 GB (a K80).

AdamDanielKing avatar Aug 24 '19 18:08 AdamDanielKing

@rfriel You mention p3.2xlarge having a 12 GB GPU–is this a typo? It should have a 16 GB V100. I wasn't able to fit 774M into only 12 GB (a K80).

Whoops! My typo was in the instance name -- I am using a p2.xlarge, which has a T80. It does fit when I use vanilla SGD (tf.train.GradientDescentOptimizer), and in fact I can fit a batch_size of 2 (haven't tried higher). I'm also using the checkpointing recommendations from this thread.

Haven't been able to get an adaptive optimizer like Adam to fit, even with a batch size of 1. I tried Adadelta too. EDIT: MomentumOptimizer fits (with use_nesterov=True although I imagine it works either way).

rfriel avatar Aug 24 '19 19:08 rfriel

FYI, all work handling the 774M model is currently being worked on the 0.6 branch.

nshepperd's fork does implement SGD as an option: I'm open to porting that into package if it does indeed help solve this problem.

minimaxir avatar Aug 27 '19 03:08 minimaxir