nni icon indicating copy to clipboard operation
nni copied to clipboard

Can't run more than n trials with trialConcurrency=n > 1

Open studywolf opened this issue 9 months ago • 8 comments

Describe the issue:

When I set trialConcurrency > 1, NNI fails out with

[2023-09-30 12:57:40] ERROR (nni.runtime.msg_dispatcher_base/Thread-1 (command_queue_worker)) 7
Traceback (most recent call last):
  File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker
    self.process_command(command, data)
  File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command
    command_handlers[command](data)
  File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 148, in handle_report_metric_data
    self._handle_final_metric_data(data)
  File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 201, in _handle_final_metric_data
    self.tuner.receive_trial_result(id_, _trial_params[id_], value, customized=customized,
  File "/home/wolf/miniconda3/envs/mausspaun/lib/python3.10/site-packages/nni/algorithms/hpo/tpe_tuner.py", line 197, in receive_trial_result
    params = self._running_params.pop(parameter_id)
KeyError: 7
[2023-09-30 12:57:41] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting...
[2023-09-30 12:57:44] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated

When the trialConcurrency = n > 1, then NNI runs n trials and fails out with this error. This happens for all the different n values i've tried (2, 5, 10, 100). When trialConcurrency=1, no problems.

Environment:

  • NNI version: 3.0
  • Training service (local|remote|pai|aml|etc): local
  • Client OS: ubuntu
  • Server OS (for remote mode only):
  • Python version: 3.10.8
  • PyTorch/TensorFlow version: N/A
  • Is conda/virtualenv/venv used?: conda
  • Is running in Docker?: no

Configuration:

  • Experiment config (remember to remove secrets!):
{
 "params": {
   "experimentType": "hpo",
   "searchSpaceFile": "/home/wolf/Dropbox/code/mouse-arm/examples/nni_arm_parameters/search_space.json",
   "trialCommand": "python nni_sweep.py",
   "trialCodeDirectory": "/home/wolf/Dropbox/code/mouse-arm/examples/nni_arm_parameters",
   "trialConcurrency": 5,
   "useAnnotation": false,
   "debug": false,
   "logLevel": "info",
   "experimentWorkingDirectory": "/home/wolf/nni-experiments",
   "tuner": {
     "name": "TPE",
     "classArgs": {
       "optimize_mode": "minimize"
     }
   },
   "trainingService": {
     "platform": "local",
     "trialCommand": "python nni_sweep.py",
     "trialCodeDirectory": "/home/wolf/Dropbox/code/mouse-arm/examples/nni_arm_parameters",
     "debug": false,
     "maxTrialNumberPerGpu": 1,
     "reuseMode": false
   }
 },
 "execDuration": "13m 8s",
 "nextSequenceId": 14,
 "revision": 95
}

I haven't created a minimal reproducible example yet, I'm hoping someone might recognize this problem, as it seems pretty basic and maybe is just a version issue somewhere?

studywolf avatar Sep 30 '23 17:09 studywolf

I encountered the same problem, sometimes it stopped after about ten trials ,and sometimes it stopped after more than 100 trials. I haven't found what caused the problem.

igodrr avatar Oct 02 '23 14:10 igodrr

I also have similar problems.

cehw avatar Oct 04 '23 05:10 cehw

I have the same issue as well and looking forward the solution.

Environment: NNI version: 3.0 Training service (local|remote|pai|aml|etc): local Client OS: ubuntu 22.04.3 Server OS (for remote mode only): Python version: 3.10.13 PyTorch/TensorFlow version: PyTorch 2.1.0 Is conda/virtualenv/venv used?: virtualenv Is running in Docker?: no

kv-42 avatar Nov 13 '23 14:11 kv-42

same issue.

[2023-11-29 21:52:30] ERROR (nni.runtime.msg_dispatcher_base/Thread-1) 1 Traceback (most recent call last): File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker self.process_command(command, data) File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command command_handlerscommand File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 148, in handle_report_metric_data self._handle_final_metric_data(data) File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 201, in handle_final_metric_data self.tuner.receive_trial_result(id, trial_params[id], value, customized=customized, File "/home/bingyaowang/anaconda3/envs/myrsn1/lib/python3.8/site-packages/nni/algorithms/hpo/tpe_tuner.py", line 197, in receive_trial_result params = self._running_params.pop(parameter_id) KeyError: 1 [2023-11-29 21:52:31] DEBUG (websockets.client/NNI-WebSocketEventLoop) < TEXT '{"type":"EN","content":"{\"trial_job_id\":\"..._index\\\": 0}\"}"}' [402 bytes] [2023-11-29 21:52:31] DEBUG (websockets.client/NNI-WebSocketEventLoop) < TEXT '{"type":"GE","content":"1"}' [27 bytes] [2023-11-29 21:52:31] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting... [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) < PING '' [0 bytes] [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) > PONG '' [0 bytes] [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) % sending keepalive ping [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) > PING c8 af a3 c2 [binary, 4 bytes] [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) < PONG c8 af a3 c2 [binary, 4 bytes] [2023-11-29 21:52:33] DEBUG (websockets.client/NNI-WebSocketEventLoop) % received keepalive pong [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) > TEXT '{"type": "bye"}' [17 bytes] [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) = connection is CLOSING [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) > CLOSE 4000 (private use) client intentionally close [28 bytes] [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) < CLOSE 4000 (private use) client intentionally close [28 bytes] [2023-11-29 21:52:34] DEBUG (websockets.client/NNI-WebSocketEventLoop) = connection is CLOSED [2023-11-29 21:52:34] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated

I set trial_concurrency==8 and it always stopped at 10~14 trials.

wby13 avatar Nov 30 '23 02:11 wby13

Same issue, I set trial_concurrency=16 and stopped at ~20 trials, the dispatcher is terminated

XYxiyang avatar Dec 15 '23 08:12 XYxiyang

Same issue here on the latest version of NNI. It seems random how many trials along it gets each time. Always

    params = self._running_params.pop(parameter_id)
KeyError: ...

in dispatcher.log.

I think the problem went away after downgrading to nni<3.

arvoelke avatar Jan 20 '24 05:01 arvoelke

I faced the same problem, and in my case, a stopgap solution is to use "Anneal" tuner instead of "TPE" tuner. Hope it help!

datngo93 avatar Jan 20 '24 06:01 datngo93

I found anything above 2.5 gives me the problem, been okay up to the hard coded memory limit with version 2.5 (roughly 45k trials)

studywolf avatar Jan 25 '24 15:01 studywolf