keras-tuner icon indicating copy to clipboard operation
keras-tuner copied to clipboard

Save per execution results (intermediate results within a trial) when execution_per_trial>=1

Open qmarcou opened this issue 2 years ago • 0 comments

Is your feature request related to a problem? Please describe.

Because of Tensorflow's (or Keras?) undiagnosed and unpatched memory leak (see tensorflow/tensorflow#36465) during training/predict keras-tuner users using a GPU might encounter an Out of Memory Error (OOM) upon tuner search although individually the largest model in the search space can fit in the GPU memory. The effect of this memory leak can be mitigated by repeatedly calling the tuner search after each OOM crash and the tuner will take over at the last saved trial. However from what I've seen in keras-tuner code the tuner will take over from the beginning of the trial even if some executions had been performed previously without error (e.g if execution_per_trial=3 and the model crashes at the second execution of the trial, the tuner will start over from execution 1 when instantiated again). This leads, at best, to some wasted computation resource (especially for large and hard to fit models), and might, in the worst case, get the tuner stuck at a given trial because of the number of execution needed in a streak (e.g if one optimizes his batch_size to use his entire GPU memory for the biggest model in order to leverage the full computational power of his device, which is desirable, the tuner will crash due to an OOM if the number of execution is >=1. In this situation picking up at the first execution of the trial will always lead to an OOM, unless using suboptimal batch_size)

Describe the solution you'd like Save the data needed to take over from the failed execution. Since the Objective is simply averaged over executions to select the best hyperparameters, keras-tuner could save an executionX.json or just save in the trial.json two new fields: 'execution_so_far' and 'average_objective_value_so_far' Then the tuner would be able to take over after the last complete execution and reuse the already computes data

Describe alternatives you've considered Fix the entire problem by calling the build_and_fit method in an isolated child process using multiprocessing with dedicated tensorflow initialization. I have tried such a fix via monkey patch and failed at it, because keras will need to import tensorflow in the parent process, see tensorflow/tensorflow#8220. One way to mitigate this would be to call a new process using 'spawn' instead of 'fork' to force tensorflow reinitialization, but this would require all keras objects passed to the child process to be pickable and they apparently aren't...

Additional context This Tensorflow OOM is a must fix problem in order to perform correct hyperparameter tuning, please help. The above proposal is only a poorman's fix over a very pervasive problem, and finding a way to use a dedicated process for each build_and_fit call (to force GPU memory release after the process termination) would be a much better fix!

qmarcou avatar Jul 04 '22 10:07 qmarcou