nni
nni copied to clipboard
Failed to receive command error in runtime JSONDecodeError
Describe the issue:
Custom NAS job with Pytorch models gives command error from NNI runtime, see below for the message.
Job only completes if exp.config.max_trial_number
is equal to exp.config.trial_concurrency
.
Environment:
- NNI version: 3.0
- Training service (local|remote|pai|aml|etc): local
- Client OS: MacOS
- Python version: 3.10.10
- PyTorch version: 2.0
- Is conda/virtualenv/venv used?: pyenv
- Is running in Docker?: no
Error Message:
[2023-09-21 11:22:18] Waiting for models submitted to engine to finish...
[2023-09-21 11:22:35] ERROR: Failed to receive command. Retry in 0s
Traceback (most recent call last):
File "/Users/user/.pyenv/versions/3.10.10/envs/platform/lib/python3.10/site-packages/nni/runtime/command_channel/websocket/channel.py", line 99, in _receive_command
command = conn.receive()
File "/Users/user/.pyenv/versions/3.10.10/envs/platform/lib/python3.10/site-packages/nni/user/command_channel/websocket/connection.py", line 116, in receive
return nni.load(msg)
File "/Users/user/.pyenv/versions/3.10.10/envs/platform/lib/python3.10/site-packages/nni/common/serializer.py", line 476, in load
return json_tricks.loads(string, obj_pairs_hooks=hooks, **json_tricks_kwargs)
File "/Users/user/.pyenv/versions/3.10.10/envs/platform/lib/python3.10/site-packages/json_tricks/nonp.py", line 259, in loads
return _strip_loads(string, hook, True, **jsonkwargs)
File "/Users/user/.pyenv/versions/3.10.10/envs/platform/lib/python3.10/site-packages/json_tricks/nonp.py", line 266, in _strip_loads
return json_loads(string, object_pairs_hook=object_pairs_hook, **jsonkwargs)
File "/Users/user/.pyenv/versions/3.10.10/lib/python3.10/json/__init__.py", line 359, in loads
return cls(**kw).decode(s)
File "/Users/user/.pyenv/versions/3.10.10/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/user/.pyenv/versions/3.10.10/lib/python3.10/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 24 (char 23)
[2023-09-21 11:22:36] Experiment is completed.
[2023-09-21 11:22:36] Search process is done. You can put an `time.sleep(FOREVER)` here to block the process if you want to continue viewing the experiment.
[2023-09-21 11:22:36] Stopping experiment, please wait...
[2023-09-21 11:22:36] Checkpoint saved to /Users/user/nni-experiments/w6mz14pl/checkpoint.
[2023-09-21 11:22:36] Experiment stopped
pushing this, as I have the same issue
Hello, I am also encountering the same issue, with the exact same error message. From looking at the logs, it looks like this happens exactly when the first trial is over.
Me too, any way to fix it?
Had the same issue.
Has anyone solved the problem?
i have same problem
I have the same problem.
I think I've found a way around this issue. In nni/nni/runtime/command_channel/websocket/connection.py, find the class WsConnection its receive function, and then for the function nni.load inside, pass ignore_comments=False
I think I've found a way around this issue. In nni/nni/runtime/command_channel/websocket/connection.py, find the class WsConnection its receive function, and then for the function nni.load inside, pass ignore_comments=False
Thank you very much!
I think I've found a way around this issue. In nni/nni/runtime/command_channel/websocket/connection.py, find the class WsConnection its receive function, and then for the function nni.load inside, pass ignore_comments=False
Does it look like this?
` def receive(self) -> Command | None:
"""
Return received message;
or return None
if the connection has been closed by peer.
"""
try:
msg = _wait(self._ws.recv())
_logger.debug(f'Received {msg}')
except websockets.ConnectionClosed: # type: ignore
_logger.debug('Connection closed by server.')
self._ws = None
_decrease_refcnt()
raise
if msg is None:
return None
# seems the library will inference whether it's text or binary, so we don't have guarantee
if isinstance(msg, bytes):
msg = msg.decode()
return nni.load(msg, ignore_comments=False)`
I think I've found a way around this issue. In nni/nni/runtime/command_channel/websocket/connection.py, find the class WsConnection its receive function, and then for the function nni.load inside, pass ignore_comments=False
Does it look like this?
def receive(self) -> Command | None: """ Return received message; or return
None` if the connection has been closed by peer. """ try: msg = _wait(self._ws.recv()) _logger.debug(f'Received {msg}') except websockets.ConnectionClosed: # type: ignore _logger.debug('Connection closed by server.') self._ws = None _decrease_refcnt() raiseif msg is None: return None # seems the library will inference whether it's text or binary, so we don't have guarantee if isinstance(msg, bytes): msg = msg.decode() return nni.load(msg, ignore_comments=False)`
Yes. Exactly. For my case, there are some strings that probably are not comments, but are regarded as comments in the json decoding phase, which leads to the failure. I just set the ignore_comments to be False and then it works.