gpt-2 icon indicating copy to clipboard operation
gpt-2 copied to clipboard

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,12,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU

Open josai opened this issue 6 years ago • 12 comments
trafficstars

Caused by op 'model/h3/attn/truediv_1', defined at: File "train.py", line 293, in main() File "train.py", line 138, in main opt_grads = memory_saving_gradients.gradients(loss, train_vars) File "C:\Users\The Atomizer\Desktop\text\gpt2\memory_saving_gradients.py", line 250, in gradients copied_sgv, info = ge.copy_with_input_replacements(ge.sgv(ops_to_copy), {}) File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 673, in copy_with_input_replacements sgv, dst_graph, dst_scope, src_scope, reuse_dst_scope=reuse_dst_scope) File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 453, in call self.copy_ops(info) File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 467, in copy_ops op, op_outputs = self.transform_op_handler(info, op, new_inputs) File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 177, in copy_op_handler [], input_types_, None, op_def_) File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\python\framework\ops.py", line 1770, in init self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,12,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node model/h3/attn/truediv_1 (defined at C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py:177) = RealDiv[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/h3/attn/Exp_1, model/h3/attn/Sum_1)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

josai avatar Jun 07 '19 21:06 josai

Whats going on here? am I out of memory? I can't get it to train.

josai avatar Jun 07 '19 21:06 josai

conda install -c anaconda cudnn==6.0.0 --yes

seemed to fix the problem...

josai avatar Jun 07 '19 22:06 josai

@josai how did you know that you needed to do conda install -c anaconda cudnn==6.0.0 --yes from the error message?

ghost avatar Jun 10 '19 17:06 ghost

@josai how did you know that you needed to do conda install -c anaconda cudnn==6.0.0 --yes from the error message?

googling similar errors and trying their solutions until one worked.

josai avatar Jun 11 '19 06:06 josai

Is this with the 345M model? I've found it only just fits in a 1080TI, so anything using substantial vram like a browser running in the background can push it over the edge.

nshepperd avatar Jun 11 '19 18:06 nshepperd

Is this with the 345M model? I've found it only just fits in a 1080TI, so anything using substantial vram like a browser running in the background can push it over the edge.

No, neither models were working until I conda installed cudnn. I am currently retraining the 345m model on GTX 970 with several applications including chrome in the background with no problems.

josai avatar Jun 11 '19 18:06 josai

I have a gtx 1060 6gb and I also have this problem. Searching on google I read that the batch size should be reduced, so I launched PYTHONPATH=src ./train.py --batch_size 1 --dataset test.txt but I had the same problem. I then changed this line to train.py: return [data_sampler.sample(1024) for _ in range(args.batch_size)] in: return [data_sampler.sample(512) for _ in range(args.batch_size)] but I don't know what this line does, how will the training change? if this change is not good how can I fix it?

iacoposk8 avatar Jul 27 '19 19:07 iacoposk8

Same problem. Are there any diagnostics we can run?

dji-transpire avatar Aug 04 '19 13:08 dji-transpire

iocaposk8, that change is one way to reduce the memory usage. You are basically shortening the model's memory there, by allowing it to only remember the last 512 words instead of the full 1024, during training. I'm not sure how much the effect on output quality would be from that.

nshepperd avatar Aug 15 '19 00:08 nshepperd

thanks for the answer, so can I use for example 895 as value? or is better number like 128, 512, 1024 ecc...?

Another question: I am training a model for my language, according to you how much should the loss be for having a good model but not going overfit?

last question: how can I generate texts that speak about a certain topic?

Thank you

iacoposk8 avatar Aug 15 '19 14:08 iacoposk8

i'm having the same problem trying to train the 355M model on a RTX2070 8GB, even with both --memory_saving_gradients and --optimizer sgd i get the following error tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,16,1024,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node model/h23/attn/MatMul_1_1}} = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/h23/attn/truediv_1, model/h23/attn/transpose_2_1)]]

i didn't use Conda but i have cudnn installed manually (cudnn v7.6.2.24 on CUDA 9.0)

ProtoxiDe22 avatar Aug 27 '19 13:08 ProtoxiDe22

I have a gtx 1060 6gb and I also have this problem. Searching on google I read that the batch size should be reduced, so I launched PYTHONPATH=src ./train.py --batch_size 1 --dataset test.txt but I had the same problem. I then changed this line to train.py: return [data_sampler.sample(1024) for _ in range(args.batch_size)] in: return [data_sampler.sample(512) for _ in range(args.batch_size)] but I don't know what this line does, how will the training change? if this change is not good how can I fix it?

This fix worked for me.

schematical avatar Jun 11 '20 19:06 schematical