HpBandSter icon indicating copy to clipboard operation
HpBandSter copied to clipboard

Loading live logs always fails

Open fmcarlucci opened this issue 5 years ago • 3 comments

Hello, I am trying to get live logging to work but never manage to successfully load the results.

To save the partial results I simply follow: https://automl.github.io/HpBandSter/build/html/advanced_examples.html#live-logging

This step appears to work, as I get a results.json and a configs.json which constantly update.

When I try to load the results with: result = core.result.logged_results_to_HBS_result(directory) I always get an error of the form:

KeyError                                  Traceback (most recent call last)
<ipython-input-64-72e5a2b8fa48> in <module>()
----> 1 result = core.result.logged_results_to_HBS_result(directory)

~/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages/hpbandster/core/result.py in logged_results_to_HBS_result(directory)
    179                         id = tuple(config_id)
    180 
--> 181                         data[id].time_stamps[budget] = time_stamps
    182                         data[id].results[budget] = result
    183                         data[id].exceptions[budget] = exception

KeyError: (0, 0, 2)

The specific KeyError changes, it's not always (0,0,2)

I've attached two example log files (I renamed them into .log for posting them here) results.log configs.log

Indeed run (0,0,2) is in results.json but not in configs.json ... any idea of what is happening?

Thanks!

fmcarlucci avatar Aug 15 '19 09:08 fmcarlucci

Hello,

It's hard to guess what is happening without seeing any code. The code will fail if you have config_ids in the results.log, but no entry in the configs.log. How that can happen, I am not sure. Could you please post the relevant script you run and how you run it (locally vs. clustert, one or multiple workers)?

Best, Stefan

On 8/15/19 11:12 AM, fmcarlucci wrote:

Hello, I am trying to get live logging to work but never manage to successfully load the results.

To save the partial results I simply follow: https://automl.github.io/HpBandSter/build/html/advanced_examples.html#live-logging

This step appears to work, as I get a |results.json| and a |configs.json| which constantly update.

When I try to load the results with: |result = core.result.logged_results_to_HBS_result(directory)| I always get an error of the form:

|KeyError Traceback (most recent call last) in () ----> 1 result = core.result.logged_results_to_HBS_result("/home/ma-user/work/BOHB/results/special_2/") ~/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages/hpbandster/core/result.py in logged_results_to_HBS_result(directory) 179 id = tuple(config_id) 180 --> 181 data[id].time_stamps[budget] = time_stamps 182 data[id].results[budget] = result 183 data[id].exceptions[budget] = exception KeyError: (0, 0, 2) |

The specific KeyError changes, it's not always (0,0,2)

I've attached two example log files (I renamed them into .log for posting them here) results.log https://github.com/automl/HpBandSter/files/3504958/results.log configs.log https://github.com/automl/HpBandSter/files/3504959/configs.log

Indeed run (0,0,2) is in results.json but not in configs.json ... any idea of what is happening?

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/automl/HpBandSter/issues/68?email_source=notifications&email_token=ABCRWXNCZWOIDRQGDASOXATQEUMZTA5CNFSM4IL4PG6KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HFMOVUA, or mute the thread https://github.com/notifications/unsubscribe-auth/ABCRWXLJ4YFYIXKEEUCTR3DQEUMZTANCNFSM4IL4PG6A.

sfalkner avatar Aug 26 '19 18:08 sfalkner

Hi Stefan, thanks for the reply.

First of all, I checked the version number and I get Version: 0.7.4 - which seems to pretty old. I guess this could already be part of the problem.

The code part is pretty straightforward:

host = hpns.nic_name_to_host(args.nic_name)
host_path = os.path.join(args.shared_directory, "host_port")

if args.worker:
    WaitForServerToComeOnline()
   w = worker(gpu_id=args.gpu_id, run_id=args.run_id, host=host, timeout=120, nameserver=ns_host,
               nameserver_port=ns_port)
    w.load_nameserver_credentials(working_directory=args.shared_directory)
    w.run(background=False)
    exit(0)


result_logger = hpres.json_result_logger(directory=args.shared_directory, overwrite=True)

# Start a nameserver:
NS = hpns.NameServer(run_id=args.run_id, host=host, port=0, working_directory=args.shared_directory)
ns_host, ns_port = NS.start()
with open(host_path, "w") as f:
    f.write("{}\n{}\n".format(ns_host, ns_port))
mox.file.copy_parallel(args.shared_directory, args.s3_directory)

# Start local worker
w = worker(gpu_id=args.gpu_id, run_id=args.run_id, host=host, nameserver=ns_host, nameserver_port=ns_port, timeout=120)
w.run(background=True)

# Run optimizer
bohb = BOHB(configspace=worker.get_configspace(),
            run_id=args.run_id,
            host=host,
            nameserver=ns_host,
            nameserver_port=ns_port,
            result_logger=result_logger,
            min_budget=args.min_budget, max_budget=args.max_budget,
            )
res = bohb.run(n_iterations=args.n_iterations)

Any ideas?

fmcarlucci avatar Aug 27 '19 10:08 fmcarlucci

When the optimization for a parameter configuration is still running, the entry already exists in the config.json but not yet in the results.json since it is not finished. That is why you run into this error. As a dirty workaround you could copy these files to a different folder and remove the parameter configuration from the config.json which is not finished yet. But I guess the code should be fixed and do that for you when loading the files..

totifra avatar Mar 10 '21 09:03 totifra