clearml
clearml copied to clipboard
HyperParameter optimizer fails after one of the experiments is aborted or fails
When one of the experiments fails - for instance if the running node is disconnected for some reason - the optimizer on the server suffers from an exception -
Updating job performance summary plot/table
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.9/threading.py", line 973, in _bootstrap_inner
self.run()
File "/usr/lib/python3.9/threading.py", line 910, in run
self._target(*self._args, **self._kwargs)
File "/home/navot/.clearml/venvs-builds.2/3.9/lib/python3.9/site-packages/clearml/automation/optimization.py", line 1766, in _report_daemon
self._report_completed_status(completed_jobs, cur_completed_jobs, task_logger, title)
File "/home/navot/.clearml/venvs-builds.2/3.9/lib/python3.9/site-packages/clearml/automation/optimization.py", line 1839, in _report_completed_status
job_ids_sorted_by_objective = sorted(
TypeError: '<' not supported between instances of 'NoneType' and 'float'
The optimizer doesn't produce new experiments and stops logging the results of previous experiments.
Hi @navotoz,
That obviously shouldn't happen :) Let me investigate!
@erezalg FYI - one of the hyperparameters type is str, while the others are floats.
Hi @navotoz, What I did was to start an HPO process with these parameters: hyper_parameters=[ UniformIntegerParameterRange('Args/batch_size', min_value=128, max_value=512, step_size=128), DiscreteParameterRange('Args/stringarg', values=['string1','string2']), ],
When I terminate the terminal running the agent that runs the experiment, I'm getting this error: [W 2022-04-24 20:52:35,508] Trial 0 failed, because the value None could not be cast to float.
Which is weird and we should investigate why it's so, but it doesn't crash :)
What clearml version are you using?
@erezalg first, thanks for your time.
I'm using clearml==1.3.2rc3
, WebApp: 1.2.0-153 • Server: 1.2.0-153 • API: 2.16
and Python 3.8.10.
Another interseting occurence is a loop in the experiments. For instance, when using two parameters with two options I'm expecting 4 experiments, but sometimes I can find one of the experiments is just repeatadly created and run. I once manually stopped the optimizer after about 1.5 times the expected amount of experiments.
To summarize - sometimes there is a crush, but sometimes there is a loop.
Hi @navotoz,
I can look at it, but I think creation of experiments is done by Optuna and not by ClearML, we just follow its instructions.
If you find a reliable reproduction method, please share!
Hi @erezalg
I'll try.