open-solution-mapping-challenge
open-solution-mapping-challenge copied to clipboard
CUDA Memory Errors at first epoch at default batch size
Good day, I would just like to ask if you guys have any idea why I am running into CUDA memory errors when running training? This happens at the end of the first epoch (epoch 0). For reference, I am just trying to reproduce the results in REPRODUCE_RESULTS.md with the smaller dataset with annotation-small.json.
My configuration is: OS: Windows 10 (Anaconda Prompt) GPU: GeForce GTX 1070Ti (single) torch version: 1.0.1
The error stack is as follows: Error stack:
2019-03-22 14-23-05 steps >>> epoch 0 average batch time: 0:00:00.7
2019-03-22 14-23-06 steps >>> epoch 0 batch 411 sum: 1.74406
2019-03-22 14-23-07 steps >>> epoch 0 batch 412 sum: 2.26457
2019-03-22 14-23-07 steps >>> epoch 0 batch 413 sum: 1.95351
2019-03-22 14-23-08 steps >>> epoch 0 batch 414 sum: 2.39538
2019-03-22 14-23-09 steps >>> epoch 0 batch 415 sum: 1.83759
2019-03-22 14-23-10 steps >>> epoch 0 batch 416 sum: 1.92264
2019-03-22 14-23-10 steps >>> epoch 0 batch 417 sum: 1.71246
2019-03-22 14-23-11 steps >>> epoch 0 batch 418 sum: 2.32141
2019-03-22 14-23-11 steps >>> epoch 0 sum: 2.18943
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
B:\ML Models\src\callbacks.py:168: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
X = Variable(X, volatile=True).cuda()
Traceback (most recent call last):
File "main.py", line 93, in <module>
main()
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 697, in main
rv = self.invoke(ctx)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 535, in invoke
return callback(*args, **kwargs)
File "main.py", line 31, in train
pipeline_manager.train(pipeline_name, dev_mode)
File "B:\ML Models\src\pipeline_manager.py", line 32, in train
train(pipeline_name, dev_mode, self.logger, self.params, self.seed)
File "B:\ML Models\src\pipeline_manager.py", line 116, in train
pipeline.fit_transform(data)
File "B:\ML Models\src\steps\base.py", line 106, in fit_transform
step_inputs[input_step.name] = input_step.fit_transform(data)
File "B:\ML Models\src\steps\base.py", line 106, in fit_transform
step_inputs[input_step.name] = input_step.fit_transform(data)
File "B:\ML Models\src\steps\base.py", line 106, in fit_transform
step_inputs[input_step.name] = input_step.fit_transform(data)
[Previous line repeated 4 more times]
File "B:\ML Models\src\steps\base.py", line 112, in fit_transform
return self._cached_fit_transform(step_inputs)
File "B:\ML Models\src\steps\base.py", line 123, in _cached_fit_transform
step_output_data = self.transformer.fit_transform(**step_inputs)
File "B:\ML Models\src\steps\base.py", line 262, in fit_transform
self.fit(*args, **kwargs)
File "B:\ML Models\src\models.py", line 82, in fit
self.callbacks.on_epoch_end()
File "B:\ML Models\src\steps\pytorch\callbacks.py", line 92, in on_epoch_end
callback.on_epoch_end(*args, **kwargs)
File "B:\ML Models\src\steps\pytorch\callbacks.py", line 163, in on_epoch_end
val_loss = self.get_validation_loss()
File "B:\ML Models\src\callbacks.py", line 132, in get_validation_loss
return self._get_validation_loss()
File "B:\ML Models\src\callbacks.py", line 138, in _get_validation_loss
outputs = self._transform()
File "B:\ML Models\src\callbacks.py", line 172, in _transform
outputs_batch = self.model(X)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\parallel\data_parallel.py", line 141, in forward
return self.module(*inputs[0], **kwargs[0])
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "B:\ML Models\src\unet_models.py", line 387, in forward
conv2 = self.conv2(conv1)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\container.py", line 92, in forward
input = module(input)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torchvision\models\resnet.py", line 88, in forward
out = self.bn3(out)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\batchnorm.py", line 76, in forward
exponential_average_factor, self.eps)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\functional.py", line 1623, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 8.00 GiB total capacity; 6.18 GiB already allocated; 56.00 MiB free; 48.95 MiB cached)
Lowering the batch size from the default 20 to 10 decreased the memory usage of the GPU from ~6GB to ~4GB, and at the end of epoch 0, increased the memory usage to ~6GB. Afterwards, subsequent epochs have continued to run in training at memory usage of ~6GB.
Is this behavior to be expected/normal? I read somewhere that you also used GTX 1070 GPUs for training, and so I thought I would be able to run training at the default batch size. Also, is it normal for GPU memory usage to increase between epochs 0 and 1? Thank you!
Hi,
I have the same issue. After the first epoch I get: RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 8.00 GiB total capacity; 6.43 GiB already allocated; 0 bytes free; 6.53 GiB reserved in total by PyTorch)
I am running the mapping challenge dataset.
I have experimented with varying batch sizes and also number workers, but the problem occurs no matter the settings.
Update: Significantly reducing the batch size has solved that issue for me. (From 20 to 8).