pix2pixHD icon indicating copy to clipboard operation
pix2pixHD copied to clipboard

CUDA out of memory OR gradient overflow with fp16

Open limt1 opened this issue 4 years ago • 1 comments

Hi all, I am either getting CUDA out of memory or gradient overflow when I enable the the fp16 option. The train runs normally without the "fp16". Any ideas?

Here it the gradient overflow:

enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1024.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 64.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.001953125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0009765625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00048828125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.103515625e-05 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9073486328125e-06 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.76837158203125e-07 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.725290298461914e-09 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.313225746154785e-10 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9103830456733704e-11 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8189894035458565e-12 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7763568394002505e-15 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.440892098500626e-16 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1102230246251565e-16 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3877787807814457e-17 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.469446951953614e-18 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.673617379884035e-19 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.710505431213761e-20 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6940658945086007e-21 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.470329472543003e-22 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.235164736271502e-22 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.117582368135751e-22 Traceback (most recent call last): File "train.py", line 85, in with amp.scale_loss(loss_G, optimizer_G) as scaled_loss: scaled_loss.backward()
File "/home//anaconda3/envs/pix2pixHD/lib/python3.7/contextlib.py", line 119, in exit next(self.gen) File "/home//anaconda3/envs/pix2pixHD/lib/python3.7/site-packages/apex/amp/handle.py", line 127, in scale_loss should_skip = False if delay_overflow_check else loss_scaler.update_scale() File "/home//anaconda3/envs/pix2pixHD/lib/python3.7/site-packages/apex/amp/scaler.py", line 200, in update_scale self._has_overflow = self._overflow_buf.item() RuntimeError: CUDA error: an illegal memory access was encountered

limt1 avatar Nov 17 '20 00:11 limt1

Have you solve it?

najingligong1111 avatar Feb 17 '21 01:02 najingligong1111