HpBandSter
HpBandSter copied to clipboard
Loading live logs always fails
Hello, I am trying to get live logging to work but never manage to successfully load the results.
To save the partial results I simply follow: https://automl.github.io/HpBandSter/build/html/advanced_examples.html#live-logging
This step appears to work, as I get a results.json
and a configs.json
which constantly update.
When I try to load the results with:
result = core.result.logged_results_to_HBS_result(directory)
I always get an error of the form:
KeyError Traceback (most recent call last)
<ipython-input-64-72e5a2b8fa48> in <module>()
----> 1 result = core.result.logged_results_to_HBS_result(directory)
~/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages/hpbandster/core/result.py in logged_results_to_HBS_result(directory)
179 id = tuple(config_id)
180
--> 181 data[id].time_stamps[budget] = time_stamps
182 data[id].results[budget] = result
183 data[id].exceptions[budget] = exception
KeyError: (0, 0, 2)
The specific KeyError changes, it's not always (0,0,2)
I've attached two example log files (I renamed them into .log for posting them here) results.log configs.log
Indeed run (0,0,2) is in results.json but not in configs.json ... any idea of what is happening?
Thanks!
Hello,
It's hard to guess what is happening without seeing any code. The code will fail if you have config_ids in the results.log, but no entry in the configs.log. How that can happen, I am not sure. Could you please post the relevant script you run and how you run it (locally vs. clustert, one or multiple workers)?
Best, Stefan
On 8/15/19 11:12 AM, fmcarlucci wrote:
Hello, I am trying to get live logging to work but never manage to successfully load the results.
To save the partial results I simply follow: https://automl.github.io/HpBandSter/build/html/advanced_examples.html#live-logging
This step appears to work, as I get a |results.json| and a |configs.json| which constantly update.
When I try to load the results with: |result = core.result.logged_results_to_HBS_result(directory)| I always get an error of the form:
|KeyError Traceback (most recent call last)
in () ----> 1 result = core.result.logged_results_to_HBS_result("/home/ma-user/work/BOHB/results/special_2/") ~/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages/hpbandster/core/result.py in logged_results_to_HBS_result(directory) 179 id = tuple(config_id) 180 --> 181 data[id].time_stamps[budget] = time_stamps 182 data[id].results[budget] = result 183 data[id].exceptions[budget] = exception KeyError: (0, 0, 2) | The specific KeyError changes, it's not always (0,0,2)
I've attached two example log files (I renamed them into .log for posting them here) results.log https://github.com/automl/HpBandSter/files/3504958/results.log configs.log https://github.com/automl/HpBandSter/files/3504959/configs.log
Indeed run (0,0,2) is in results.json but not in configs.json ... any idea of what is happening?
Thanks!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/automl/HpBandSter/issues/68?email_source=notifications&email_token=ABCRWXNCZWOIDRQGDASOXATQEUMZTA5CNFSM4IL4PG6KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HFMOVUA, or mute the thread https://github.com/notifications/unsubscribe-auth/ABCRWXLJ4YFYIXKEEUCTR3DQEUMZTANCNFSM4IL4PG6A.
Hi Stefan, thanks for the reply.
First of all, I checked the version number and I get Version: 0.7.4 - which seems to pretty old. I guess this could already be part of the problem.
The code part is pretty straightforward:
host = hpns.nic_name_to_host(args.nic_name)
host_path = os.path.join(args.shared_directory, "host_port")
if args.worker:
WaitForServerToComeOnline()
w = worker(gpu_id=args.gpu_id, run_id=args.run_id, host=host, timeout=120, nameserver=ns_host,
nameserver_port=ns_port)
w.load_nameserver_credentials(working_directory=args.shared_directory)
w.run(background=False)
exit(0)
result_logger = hpres.json_result_logger(directory=args.shared_directory, overwrite=True)
# Start a nameserver:
NS = hpns.NameServer(run_id=args.run_id, host=host, port=0, working_directory=args.shared_directory)
ns_host, ns_port = NS.start()
with open(host_path, "w") as f:
f.write("{}\n{}\n".format(ns_host, ns_port))
mox.file.copy_parallel(args.shared_directory, args.s3_directory)
# Start local worker
w = worker(gpu_id=args.gpu_id, run_id=args.run_id, host=host, nameserver=ns_host, nameserver_port=ns_port, timeout=120)
w.run(background=True)
# Run optimizer
bohb = BOHB(configspace=worker.get_configspace(),
run_id=args.run_id,
host=host,
nameserver=ns_host,
nameserver_port=ns_port,
result_logger=result_logger,
min_budget=args.min_budget, max_budget=args.max_budget,
)
res = bohb.run(n_iterations=args.n_iterations)
Any ideas?
When the optimization for a parameter configuration is still running, the entry already exists in the config.json but not yet in the results.json since it is not finished. That is why you run into this error. As a dirty workaround you could copy these files to a different folder and remove the parameter configuration from the config.json which is not finished yet. But I guess the code should be fixed and do that for you when loading the files..