nni icon indicating copy to clipboard operation
nni copied to clipboard

Failed to receive command error in runtime JSONDecodeError

Open msuzen opened this issue 1 year ago • 11 comments

Describe the issue:

Custom NAS job with Pytorch models gives command error from NNI runtime, see below for the message. Job only completes if exp.config.max_trial_number is equal to exp.config.trial_concurrency.

Environment:

  • NNI version: 3.0
  • Training service (local|remote|pai|aml|etc): local
  • Client OS: MacOS
  • Python version: 3.10.10
  • PyTorch version: 2.0
  • Is conda/virtualenv/venv used?: pyenv
  • Is running in Docker?: no

Error Message:

[2023-09-21 11:22:18] Waiting for models submitted to engine to finish...
[2023-09-21 11:22:35] ERROR: Failed to receive command. Retry in 0s
Traceback (most recent call last):
  File "/Users/user/.pyenv/versions/3.10.10/envs/platform/lib/python3.10/site-packages/nni/runtime/command_channel/websocket/channel.py", line 99, in _receive_command
    command = conn.receive()
  File "/Users/user/.pyenv/versions/3.10.10/envs/platform/lib/python3.10/site-packages/nni/user/command_channel/websocket/connection.py", line 116, in receive
    return nni.load(msg)
  File "/Users/user/.pyenv/versions/3.10.10/envs/platform/lib/python3.10/site-packages/nni/common/serializer.py", line 476, in load
    return json_tricks.loads(string, obj_pairs_hooks=hooks, **json_tricks_kwargs)
  File "/Users/user/.pyenv/versions/3.10.10/envs/platform/lib/python3.10/site-packages/json_tricks/nonp.py", line 259, in loads
    return _strip_loads(string, hook, True, **jsonkwargs)
  File "/Users/user/.pyenv/versions/3.10.10/envs/platform/lib/python3.10/site-packages/json_tricks/nonp.py", line 266, in _strip_loads
    return json_loads(string, object_pairs_hook=object_pairs_hook, **jsonkwargs)
  File "/Users/user/.pyenv/versions/3.10.10/lib/python3.10/json/__init__.py", line 359, in loads
    return cls(**kw).decode(s)
  File "/Users/user/.pyenv/versions/3.10.10/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Users/user/.pyenv/versions/3.10.10/lib/python3.10/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 24 (char 23)
[2023-09-21 11:22:36] Experiment is completed.
[2023-09-21 11:22:36] Search process is done. You can put an `time.sleep(FOREVER)` here to block the process if you want to continue viewing the experiment.
[2023-09-21 11:22:36] Stopping experiment, please wait...
[2023-09-21 11:22:36] Checkpoint saved to /Users/user/nni-experiments/w6mz14pl/checkpoint.
[2023-09-21 11:22:36] Experiment stopped

msuzen avatar Sep 21 '23 09:09 msuzen

pushing this, as I have the same issue

mo-tion avatar Sep 29 '23 08:09 mo-tion

Hello, I am also encountering the same issue, with the exact same error message. From looking at the logs, it looks like this happens exactly when the first trial is over.

ElbazHaim avatar Oct 25 '23 15:10 ElbazHaim

Me too, any way to fix it?

AlondraMM avatar Dec 04 '23 06:12 AlondraMM

Had the same issue.

liuzhengx avatar Dec 05 '23 13:12 liuzhengx

Has anyone solved the problem?

jimmy133719 avatar Jan 24 '24 04:01 jimmy133719

i have same problem

z520yu avatar Mar 18 '24 13:03 z520yu

I have the same problem.

Mingbo-Lee avatar Apr 14 '24 13:04 Mingbo-Lee

I think I've found a way around this issue. In nni/nni/runtime/command_channel/websocket/connection.py, find the class WsConnection its receive function, and then for the function nni.load inside, pass ignore_comments=False

haoshuai-orka avatar Apr 18 '24 12:04 haoshuai-orka

I think I've found a way around this issue. In nni/nni/runtime/command_channel/websocket/connection.py, find the class WsConnection its receive function, and then for the function nni.load inside, pass ignore_comments=False

Thank you very much!

Mingbo-Lee avatar Apr 19 '24 03:04 Mingbo-Lee

I think I've found a way around this issue. In nni/nni/runtime/command_channel/websocket/connection.py, find the class WsConnection its receive function, and then for the function nni.load inside, pass ignore_comments=False

Does it look like this? ` def receive(self) -> Command | None: """ Return received message; or return None if the connection has been closed by peer. """ try: msg = _wait(self._ws.recv()) _logger.debug(f'Received {msg}') except websockets.ConnectionClosed: # type: ignore _logger.debug('Connection closed by server.') self._ws = None _decrease_refcnt() raise

    if msg is None:
        return None
    # seems the library will inference whether it's text or binary, so we don't have guarantee
    if isinstance(msg, bytes):
        msg = msg.decode()
    return nni.load(msg, ignore_comments=False)`
 

ranranrannervous avatar Apr 19 '24 05:04 ranranrannervous

I think I've found a way around this issue. In nni/nni/runtime/command_channel/websocket/connection.py, find the class WsConnection its receive function, and then for the function nni.load inside, pass ignore_comments=False

Does it look like this? def receive(self) -> Command | None: """ Return received message; or returnNone` if the connection has been closed by peer. """ try: msg = _wait(self._ws.recv()) _logger.debug(f'Received {msg}') except websockets.ConnectionClosed: # type: ignore _logger.debug('Connection closed by server.') self._ws = None _decrease_refcnt() raise

    if msg is None:
        return None
    # seems the library will inference whether it's text or binary, so we don't have guarantee
    if isinstance(msg, bytes):
        msg = msg.decode()
    return nni.load(msg, ignore_comments=False)`

Yes. Exactly. For my case, there are some strings that probably are not comments, but are regarded as comments in the json decoding phase, which leads to the failure. I just set the ignore_comments to be False and then it works.

haoshuai-orka avatar Apr 24 '24 13:04 haoshuai-orka