clearml icon indicating copy to clipboard operation
clearml copied to clipboard

HyperParameter optimizer fails after one of the experiments is aborted or fails

Open navotoz opened this issue 2 years ago • 6 comments

When one of the experiments fails - for instance if the running node is disconnected for some reason - the optimizer on the server suffers from an exception -

Updating job performance summary plot/table
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/navot/.clearml/venvs-builds.2/3.9/lib/python3.9/site-packages/clearml/automation/optimization.py", line 1766, in _report_daemon
    self._report_completed_status(completed_jobs, cur_completed_jobs, task_logger, title)
  File "/home/navot/.clearml/venvs-builds.2/3.9/lib/python3.9/site-packages/clearml/automation/optimization.py", line 1839, in _report_completed_status
    job_ids_sorted_by_objective = sorted(
TypeError: '<' not supported between instances of 'NoneType' and 'float'

The optimizer doesn't produce new experiments and stops logging the results of previous experiments.

navotoz avatar Apr 14 '22 05:04 navotoz

Hi @navotoz,

That obviously shouldn't happen :) Let me investigate!

erezalg avatar Apr 18 '22 08:04 erezalg

@erezalg FYI - one of the hyperparameters type is str, while the others are floats.

navotoz avatar Apr 20 '22 06:04 navotoz

Hi @navotoz, What I did was to start an HPO process with these parameters: hyper_parameters=[ UniformIntegerParameterRange('Args/batch_size', min_value=128, max_value=512, step_size=128), DiscreteParameterRange('Args/stringarg', values=['string1','string2']), ],

When I terminate the terminal running the agent that runs the experiment, I'm getting this error: [W 2022-04-24 20:52:35,508] Trial 0 failed, because the value None could not be cast to float.

Which is weird and we should investigate why it's so, but it doesn't crash :)

What clearml version are you using?

erezalg avatar Apr 24 '22 17:04 erezalg

@erezalg first, thanks for your time. I'm using clearml==1.3.2rc3, WebApp: 1.2.0-153 • Server: 1.2.0-153 • API: 2.16 and Python 3.8.10.

Another interseting occurence is a loop in the experiments. For instance, when using two parameters with two options I'm expecting 4 experiments, but sometimes I can find one of the experiments is just repeatadly created and run. I once manually stopped the optimizer after about 1.5 times the expected amount of experiments.

To summarize - sometimes there is a crush, but sometimes there is a loop.

navotoz avatar Apr 25 '22 05:04 navotoz

Hi @navotoz,

I can look at it, but I think creation of experiments is done by Optuna and not by ClearML, we just follow its instructions.

If you find a reliable reproduction method, please share!

erezalg avatar Apr 25 '22 07:04 erezalg

Hi @erezalg

I'll try.

navotoz avatar Apr 25 '22 15:04 navotoz