talos
talos copied to clipboard
ResourceExhaustedError after several iterations in a grid search
First off, make sure to check your support options.
The preferred way to resolve usage related matters is through the docs which are maintained up-to-date with the latest version of Talos.
If you do end up asking for support in a new issue, make sure to follow the below steps carefully.
1) Confirm the below
- [x] I have looked for an answer in the Docs
- [x] My Python version is 3.5 or higher
- [x] I have searched through the issues Issues for a duplicate
- [x] I've tested that my Keras model works as a stand-alone
2) Include the output of:
talos.__version__ == 0.6.7
3) Explain clearly what you are trying to achieve
I am running a grid search that gives 36 rounds.
After about 4 or 5 rounds, during a model.fit I suddenly get hit by a ResourceExhaustedError
. I think this is very odd given that I am able to complete at least 3 rounds of fitting on the GPU (with a model and batch size that takes up pretty much all the gpu memory), so it seems that there is a small but significant memory leak somewhere. Any ideas what it could be?
My parameter dictionary is:
p = {
"sigma_noise": [0, 0.01],
"nb_filters_0": [16, 32, 64],
"loss_func": ["cat_CE", "tversky_loss", "cat_FL"],
"arch": ["U-Net"],
"act": [Swish, ReLU],
}
And I'm running a U-net with 34 million trainable parameters (for nb_filters_0 == 64
) and input dimensions of (208, 208, 3)
with a batch size of 12 and 400 epochs.
UPDATE: I did a "quick" test, where I ran each model for only 50 epochs and I got a ResourceExhaustedError
again in round 4 during the 5th epoch, and I think that was actually the exact same spot as before when each of the 3 previous model had run for +100 epochs. This tells me, that the models are not properly cleaned out of the GPU memory and on top of that, I might have a memory leak in my generator. @mikkokotila, what do you think?
Very interesting. Can you post your full trace.
Of course! See below. I also added the output leading up, because I think it gives some idea of how the exception occurs.
---------------------------------------------------------------------------
ResourceExhaustedError Traceback (most recent call last)
<ipython-input-10-d1427f7c3b24> in <module>
104 params=p,
105 experiment_name="talos/" + date_string,
--> 106 reduction_method='gamify',
107 )
108
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/scan/Scan.py in __init__(self, x, y, params, model, experiment_name, x_val, y_val, val_split, random_method, seed, performance_target, fraction_limit, round_limit, time_limit, boolean_limit, reduction_method, reduction_interval, reduction_window, reduction_threshold, reduction_metric, minimize_loss, disable_progress_bar, print_params, clear_session, save_weights)
194 # start runtime
195 from .scan_run import scan_run
--> 196 scan_run(self)
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/scan/scan_run.py in scan_run(self)
24 # otherwise proceed with next permutation
25 from .scan_round import scan_round
---> 26 self = scan_round(self)
27 self.pbar.update(1)
28
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/scan/scan_round.py in scan_round(self)
17 # fit the model
18 from ..model.ingest_model import ingest_model
---> 19 self.model_history, self.round_model = ingest_model(self)
20 self.round_history.append(self.model_history.history)
21
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/model/ingest_model.py in ingest_model(self)
8 self.x_val,
9 self.y_val,
---> 10 self.round_params)
~/myapps/mmciad/src/mmciad/utils/hyper.py in talos_model(x, y, val_x, val_y, talos_params)
301 class_weight=class_weights,
302 verbose=internal_params["verbose"],
--> 303 callbacks=model_callbacks + opti_callbacks,
304 )
305 return history, model
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
89 warnings.warn('Update your `' + object_name + '` call to the ' +
90 'Keras 2 API: ' + signature, stacklevel=2)
---> 91 return func(*args, **kwargs)
92 wrapper._original_function = func
93 return wrapper
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
1730 use_multiprocessing=use_multiprocessing,
1731 shuffle=shuffle,
-> 1732 initial_epoch=initial_epoch)
1733
1734 @interfaces.legacy_generator_methods_support
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/engine/training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
218 sample_weight=sample_weight,
219 class_weight=class_weight,
--> 220 reset_metrics=False)
221
222 outs = to_list(outs)
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight, reset_metrics)
1512 ins = x + y + sample_weights
1513 self._make_train_function()
-> 1514 outputs = self.train_function(ins)
1515
1516 if reset_metrics:
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/keras/backend.py in __call__(self, inputs)
3290
3291 fetched = self._callable_fn(*array_vals,
-> 3292 run_metadata=self.run_metadata)
3293 self._call_fetch_callbacks(fetched[-len(self._fetches):])
3294 output_structure = nest.pack_sequence_as(
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
1456 ret = tf_session.TF_SessionRunCallable(self._session._session,
1457 self._handle, args,
-> 1458 run_metadata_ptr)
1459 if run_metadata:
1460 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
ResourceExhaustedError: OOM when allocating tensor with shape[16,192,208,208] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node training/Adam/gradients/block1_u_conv1/convolution_grad/Conv2DBackpropInput}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Have you looked at this SO post.
How much memory your GPU has?
I just check out your link and it does not appear to describe the issue I am having, though at first it did look similar. I am running with an Nvidia GeForce GTX 1080 TI with 11 GB ram.
Can you do this:
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
...and share the output you get.
I would love to, but that option crashes my python kernel, so it's not really possible. This is a long-standing Keras bug, I believe.
Yes, it most certainly is an upstream bug in Keras or TensorFlow.
To avoid doubt, can you share your Scan()
command.
Also, how about giving Talos 1.0 a shot. It will use different backend, so you might have better luck.
Sure! I use custom keras.utils.Sequence
data generators, so I have two dummy variables for my scan command as shown below:
dummy_x = np.empty((1, BATCH_SIZE, 208, 208))
dummy_y = np.empty((1, BATCH_SIZE))
scan_object = ta.Scan(
x=dummy_x,
y=dummy_y,
disable_progress_bar=False,
print_params=True,
model=talos_model,
params=p,
experiment_name="talos/" + date_string,
reduction_method='gamify',
)
I will take a look at talos 1.0 right away!
So running talos 1.0 had the same outcome, but with a slightly different error message at the end:
14% |█▌ | 5/36 [1:52:10<11:43:52, 1362.35s/it]
{'act': <class 'keras_contrib.layers.advanced_activations.swish.Swish'>, 'arch': 'U-Net', 'loss_func': 'cat_CE', 'nb_filters_0': 64, 'sigma_noise': 0.01}
tracking <tf.Variable 'block1_d_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block1_d_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block2_d_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block2_d_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block3_d_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block3_d_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block4_d_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block4_d_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block5_bottom_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block5_bottom_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block4_u_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block4_u_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block3_u_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block3_u_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block2_u_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block2_u_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block1_u_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block1_u_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
Training | | 0% 0/5 [00:00<?, ?it/s]
Epoch 0 |██▌ | [loss: 2.2988, acc: 0.1848, jaccard1_coef: 0.0575] : 25% 113/451 [01:19<03:12, 1.76it/s]
---------------------------------------------------------------------------
ResourceExhaustedError Traceback (most recent call last)
<ipython-input-10-050e8f7c8199> in <module>
104 params=p,
105 experiment_name="talos/" + date_string,
--> 106 reduction_method='gamify',
107 )
108
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/scan/Scan.py in __init__(self, x, y, params, model, experiment_name, x_val, y_val, val_split, random_method, seed, performance_target, fraction_limit, round_limit, time_limit, boolean_limit, reduction_method, reduction_interval, reduction_window, reduction_threshold, reduction_metric, minimize_loss, disable_progress_bar, print_params, clear_session, save_weights)
194 # start runtime
195 from .scan_run import scan_run
--> 196 scan_run(self)
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/scan/scan_run.py in scan_run(self)
24 # otherwise proceed with next permutation
25 from .scan_round import scan_round
---> 26 self = scan_round(self)
27 self.pbar.update(1)
28
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/scan/scan_round.py in scan_round(self)
17 # fit the model
18 from ..model.ingest_model import ingest_model
---> 19 self.model_history, self.round_model = ingest_model(self)
20 self.round_history.append(self.model_history.history)
21
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/model/ingest_model.py in ingest_model(self)
8 self.x_val,
9 self.y_val,
---> 10 self.round_params)
~/myapps/mmciad/src/mmciad/utils/hyper.py in talos_model(x, y, val_x, val_y, talos_params)
303 class_weight=class_weights,
304 verbose=internal_params["verbose"],
--> 305 callbacks=model_callbacks + opti_callbacks,
306 )
307 return history, model
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
89 warnings.warn('Update your `' + object_name + '` call to the ' +
90 'Keras 2 API: ' + signature, stacklevel=2)
---> 91 return func(*args, **kwargs)
92 wrapper._original_function = func
93 return wrapper
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
1730 use_multiprocessing=use_multiprocessing,
1731 shuffle=shuffle,
-> 1732 initial_epoch=initial_epoch)
1733
1734 @interfaces.legacy_generator_methods_support
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/engine/training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
218 sample_weight=sample_weight,
219 class_weight=class_weight,
--> 220 reset_metrics=False)
221
222 outs = to_list(outs)
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight, reset_metrics)
1512 ins = x + y + sample_weights
1513 self._make_train_function()
-> 1514 outputs = self.train_function(ins)
1515
1516 if reset_metrics:
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/keras/backend.py in __call__(self, inputs)
3290
3291 fetched = self._callable_fn(*array_vals,
-> 3292 run_metadata=self.run_metadata)
3293 self._call_fetch_callbacks(fetched[-len(self._fetches):])
3294 output_structure = nest.pack_sequence_as(
~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
1456 ret = tf_session.TF_SessionRunCallable(self._session._session,
1457 self._handle, args,
-> 1458 run_metadata_ptr)
1459 if run_metadata:
1460 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[16,64,208,208] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node GaussianNoise_preout/cond/add-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[metrics/acc/Identity/_1095]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[16,64,208,208] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node GaussianNoise_preout/cond/add-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
Can you run the input model in a loop a few times and see if you can get the same result. If yes, I suggest posting this directly to TensorFlow.
Do you mean a simple loop like this:
model = create_model(*args, **kwargs)
for _ in range(6):
model.compile(**kwargs)
model.fit(train_generator, epochs=10)
Or should I try to add some sort of garbage collection to this?
No, just the simplest possible loop.
Is the above code example simple enough or can it be even simpler?
EDIT: BTW, can you maybe elaborate a bit on what was changed in talos 1.0? I tried upgrading to tf-2.2.0rc3 because it fixed a memory leak in fit method related to the keras sequence class.
So far, when running a loop like the one I wrote earlier, I am not getting any ResourceExhaustedError. I have almost completed 5 iterations of the loop with 50 epochs pr iteration. With talos it crashed in the beginning of the fifth iteration.
Okay, so the loop was set to do ten training sessions of 50 epochs since I knew that 50 epochs was enough to get the ResourceExhaustedError
after 5 iterations in the talos Scan()
. Now it has completed all 10 passes of the loop without any errors whatsoever. I assume this rules out it being a TensorFlow bug?
For good measure, I redid the Scan()
just to confirm that updating some of the packages did not alter the outcome. Rather than getting the ResourceExhaustedError
, my kernel crashed completely (though it may still be due to a ResourceExhaustedError
). Any ideas on how to proceed?
I have now tested it on a different machine with a larger GPU (the NVIDIA Quatro RTX6000 with 24 GB ram) and the same thing happens.
To summarize
The bug(?) appears on two systems with the below configuration(s):
- Nvidia GTX 1080 TI or Nvidia Quattro RTX 6000
- Nvidia 418.87.00
- CUDA 10.1
- CuDNN 7.6.5
- Python >=3.7.6
- Both TensorFlow 1.13, 2.1, and >=2.2.0rc2
- Talos >= 0.6.0
It does not appear to happen if a model is compiled and fitted several times in a simple loop, which seem to eliminate the possibility of this being a TensorFlow problem.
Is it possible for you to share a Jupyter notebook or colab which is self-contained, so I can just run and repeat.
Also, is create_model
identical in both cases?
Yes, create_model
is identical. I will try and see if I can make a self-contained notebook. Currently, my solution has been to modify talos to accept a new boolean parameter allow_resume
, which if True savew ParamSpace, list of keys/metrics and the various stores to files on disk and in the event of a crash (or interrupt) it will read these files and restore the important parts of the Scan
object before executing scan_run()
. It might sacrifice some efficiency, but it sure beats never getting to the finish line ;)
BTW, are there any special considerations behind doing method level imports rather than module/top level imports?
EDIT: If you want, you can have a look at my fork of talos and see what i changed. I haven't committed the latest addition yet, but the primary stuff is in place.
Sorry, I totally missed this.
BTW, are there any special considerations behind doing method level imports rather than module/top level imports?
Yes. Chunks of code are self-contained, readability improves, import only if need etc.
How about we implement the above-said feature into v1.1?
We could do that, but I'm not sure my hack is the best way to go at this point. It makes sense for me, but it could be a lot cleaner, I think. Perhaps storing everything in one file rather than having about three different files to read from :) I will be happy to show you the changes I made, though, and you can decide for yourself what you think of it.
You can look at the changes and additions here: https://github.com/bjtho08/talos/tree/1.0.1-dev
Thanks. Do I understand correctly, the feature is simply to:
- allow storing a "restore point" as an option of
Scan()
- being able to refer to a file the "restore point" is stored
Is there anything I'm missing?
Yep, that pretty much sums it up.
In the project directory (where the logging csv file is stored), three additional files are created: a pickle that contains the various stores from each run; a yaml that lists the remaining instances in the paramspace (dumped from self.param_object
) and and a yaml file containing the self._all_keys, self._metric_keys,
and self._val_keys
.
As I said, it can most likely be done in a cleaner fashion. I just hacked this together in a few days to work around my issue with constant crashing after a few iterations :)
Hello, have you already found a working solution or a workaround for your problem? Because I am currently facing the same issue.
I will try to work on this next week.