orion icon indicating copy to clipboard operation
orion copied to clipboard

Hyperband rng initialization seems broken

Open legaultmarc opened this issue 1 year ago • 3 comments

When running orion hunt, the first iteration typically works fine, but I get the following traceback on the second iteration:

Traceback (most recent call last):
  File "/home/legaultm/mlenv3/bin/orion", line 8, in <module>
    sys.exit(main())
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/core/cli/__init__.py", line 36, in main
    return orion_parser.execute(argv)
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/core/cli/base.py", line 110, in execute
    returncode = function(args)
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/core/cli/hunt.py", line 209, in main
    workon(experiment, ignore_code_changes=ignore_code_changes, **worker_config)
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/core/cli/hunt.py", line 163, in workon
    client.workon(
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/client/experiment.py", line 810, in workon
    rval = runner.run()
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/client/runner.py", line 306, in run
    gathered = self.gather()
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/client/runner.py", line 409, in gather
    self.client.observe(trial, result.value)
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/client/experiment.py", line 619, in observe
    self._producer.observe(trial)
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/core/worker/producer.py", line 38, in observe
    algorithm.observe([trial])
  File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/core/worker/experiment.py", line 465, in acquire_algorithm_lock
    locked_algorithm_state.set_state(self.algorithms.state_dict)
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/core/worker/primary_algo.py", line 103, in state_dict
    "algorithm": self.algorithm.state_dict,
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/algo/hyperband.py", line 285, in state_dict
    "rng_state": self.rng.get_state(),
AttributeError: 'Hyperband' object has no attribute 'rng'

I am under the impression that self.seed_rng() is not called properly to initialize the self.rng attribute causing this error.

Expected behavior I don't think this AttributeError should happen.

Steps to reproduce In my case, this happens after calling orion hunt on a single machine.

Environment (please complete the following information):

  • OS: Windows 11 WSL -> Ubuntu 20.04.5 LTS
  • Python version: 3.8
  • Oríon version: 0.2.6
  • Database: PickleDB

Additional context Here is my Orion config file:

database:
  host: /home/legaultm/.local/share/orion.core/orion/orion_db.pkl
  type: pickleddb

experiment:
  algorithms:
    hyperband:
      seed: 42
      repetitions: 1

evc:
  enable: True

** Possible solution** I initialize the seed in my config, and this seems to fix the problem for the first iteration, but not for subsequent iterations. If I force a call to self.seed_rng() in the Hyperband class init, I seem to be able to circumvent the problem. I'm not sure what's the right fix for this.

legaultmarc avatar Sep 28 '22 19:09 legaultmarc

Hi @legaultmarc, thanks for the detailed bug report! We will look into this asap.

bouthilx avatar Sep 28 '22 20:09 bouthilx

I did not manage to reproduce the issue using your config file, but looking at the code I can see that using no seed would cause this issue. Removing the seed from your config causes the issue on my side. Did you run without a seed before? We'll fix the issue when there are no seeds, but I'd like to be sure that there are no other corner cases that we are missing.

bouthilx avatar Sep 28 '22 20:09 bouthilx

It seems that now even I can't reproduce this bug. I was trying out different algorithms when I first encountered this bug so maybe it was due to a weird state in the config/database or some other Python caching? I too now only get it when the seed is set to null in the config. I'll report back here if it happens again...

Thanks for your rapid response :)

legaultmarc avatar Sep 29 '22 01:09 legaultmarc