nni
nni copied to clipboard
Can't run more than n trials with trialConcurrency=n > 1
Describe the issue:
When I set trialConcurrency > 1, NNI fails out with
[2023-09-30 12:57:40] ERROR (nni.runtime.msg_dispatcher_base/Thread-1 (command_queue_worker)) 7
Traceback (most recent call last):
File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker
self.process_command(command, data)
File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command
command_handlers[command](data)
File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 148, in handle_report_metric_data
self._handle_final_metric_data(data)
File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 201, in _handle_final_metric_data
self.tuner.receive_trial_result(id_, _trial_params[id_], value, customized=customized,
File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/algorithms/hpo/tpe_tuner.py", line 197, in receive_trial_result
params = self._running_params.pop(parameter_id)
KeyError: 7
[2023-09-30 12:57:41] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting...
[2023-09-30 12:57:44] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated
When the trialConcurrency = n > 1, then NNI runs n trials and fails out with this error. This happens for all the different n values i've tried (2, 5, 10, 100). When trialConcurrency=1, no problems.
Environment:
- NNI version: 3.0
- Training service (local|remote|pai|aml|etc): local
- Client OS: ubuntu
- Server OS (for remote mode only):
- Python version: 3.10.8
- PyTorch/TensorFlow version: N/A
- Is conda/virtualenv/venv used?: conda
- Is running in Docker?: no
Configuration:
- Experiment config (remember to remove secrets!):
{
"params": {
"experimentType": "hpo",
"searchSpaceFile": "/home/wolf/Dropbox/code/mouse-arm/examples/nni_arm_parameters/search_space.json",
"trialCommand": "python nni_sweep.py",
"trialCodeDirectory": "/home/wolf/Dropbox/code/mouse-arm/examples/nni_arm_parameters",
"trialConcurrency": 5,
"useAnnotation": false,
"debug": false,
"logLevel": "info",
"experimentWorkingDirectory": "/home/wolf/nni-experiments",
"tuner": {
"name": "TPE",
"classArgs": {
"optimize_mode": "minimize"
}
},
"trainingService": {
"platform": "local",
"trialCommand": "python nni_sweep.py",
"trialCodeDirectory": "/home/wolf/Dropbox/code/mouse-arm/examples/nni_arm_parameters",
"debug": false,
"maxTrialNumberPerGpu": 1,
"reuseMode": false
}
},
"execDuration": "13m 8s",
"nextSequenceId": 14,
"revision": 95
}
I haven't created a minimal reproducible example yet, I'm hoping someone might recognize this problem, as it seems pretty basic and maybe is just a version issue somewhere?
I encountered the same problem, sometimes it stopped after about ten trials ,and sometimes it stopped after more than 100 trials. I haven't found what caused the problem.
I also have similar problems.
I have the same issue as well and looking forward the solution.
Environment: NNI version: 3.0 Training service (local|remote|pai|aml|etc): local Client OS: ubuntu 22.04.3 Server OS (for remote mode only): Python version: 3.10.13 PyTorch/TensorFlow version: PyTorch 2.1.0 Is conda/virtualenv/venv used?: virtualenv Is running in Docker?: no
same issue.
[2023-11-29 21:52:30] ERROR (nni.runtime.msg_dispatcher_base/Thread-1) 1 Traceback (most recent call last): File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker self.process_command(command, data) File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command command_handlerscommand File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 148, in handle_report_metric_data self._handle_final_metric_data(data) File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 201, in handle_final_metric_data self.tuner.receive_trial_result(id, trial_params[id], value, customized=customized, File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/algorithms/hpo/tpe_tuner.py", line 197, in receive_trial_result params = self._running_params.pop(parameter_id) KeyError: 1 [2023-11-29 21:52:31] DEBUG (websockets.client/NNI-WebSocketEventLoop) < TEXT '{"type":"EN","content":"{\"trial_job_id\":\"..._index\\\": 0}\"}"}' [402 bytes] [2023-11-29 21:52:31] DEBUG (websockets.client/NNI-WebSocketEventLoop) < TEXT '{"type":"GE","content":"1"}' [27 bytes] [2023-11-29 21:52:31] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting... [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) < PING '' [0 bytes] [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) > PONG '' [0 bytes] [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) % sending keepalive ping [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) > PING c8 af a3 c2 [binary, 4 bytes] [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) < PONG c8 af a3 c2 [binary, 4 bytes] [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) % received keepalive pong [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) > TEXT '{"type": "bye"}' [17 bytes] [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) = connection is CLOSING [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) > CLOSE 4000 (private use) client intentionally close [28 bytes] [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) < CLOSE 4000 (private use) client intentionally close [28 bytes] [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) = connection is CLOSED [2023-11-29 21:52:34] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated
I set trial_concurrency==8 and it always stopped at 10~14 trials.
Same issue, I set trial_concurrency=16 and stopped at ~20 trials, the dispatcher is terminated
Same issue here on the latest version of NNI. It seems random how many trials along it gets each time. Always
params = self._running_params.pop(parameter_id)
KeyError: ...
in dispatcher.log
.
I think the problem went away after downgrading to nni<3
.
I faced the same problem, and in my case, a stopgap solution is to use "Anneal" tuner instead of "TPE" tuner. Hope it help!
I found anything above 2.5 gives me the problem, been okay up to the hard coded memory limit with version 2.5 (roughly 45k trials)