nni icon indicating copy to clipboard operation
nni copied to clipboard

The system crashes when running the “Hello NAS” example code with GPU

Open misagamisaga opened this issue 1 year ago • 7 comments

The system crashes when running the “Hello NAS” example code with GPU

  • The crash occurs in three environments: colab, Windows11 system using conda, and Windows11 system without using conda.
  • In all three environments, I tried to downgrade pytorch to version 13.0, but it still crashes.
  • And it crashes when running in both .py and .ipynb modes.

My steps

  • colab: use pip to install nni and lightning, then run the “Hello NAS” example code.
  • Windows11: install pytorch (using the official installation instructions, which also installs torchvision), nni, lightning, ipykernel, jupyterlab

I cleared my environment beforehand, there are no extra package conflicts

The details of one of the errors

(Env: Windows11, using conda, pytorch2.2.0) Before the crash, I saw a lot of python.exe in the task manager I recorded the error at that time:

[2024-02-20 22:24:17] Creating experiment, Experiment ID: 5p9fhwgt
[2024-02-20 22:24:17] Starting web server...
[2024-02-20 22:24:20] Setting up...
[2024-02-20 22:24:20] Web portal URLs: http://26.26.26.1:8084 http://169.254.77.17:8084 http://169.254.202.152:8084 http://169.254.67.238:8084 http://192.168.101.15:8084 http://127.0.0.1:8084
[2024-02-20 22:24:21] Successfully update searchSpace.
[2024-02-20 22:24:21] Checkpoint saved to C:\Users\DELL\nni-experiments\5p9fhwgt\checkpoint.
[2024-02-20 22:24:21] Experiment initialized successfully. Starting exploration strategy...
[2024-02-20 22:24:59] ERROR: Strategy failed to execute.
Traceback (most recent call last):
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connection.py", line 466, in getresponse
    httplib_response = super().getresponse()
  File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 1375, in getresponse
    response.begin()
  File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "E:\conda\envs\pytorch_nni\lib\socket.py", line 705, in readinto
    return self._sock.recv_into(b)
TimeoutError: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\adapters.py", line 486, in send
    resp = conn.urlopen(
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 847, in urlopen
    retries = retries.increment(
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\util\retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\util\util.py", line 39, in reraise
    raise value
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 793, in urlopen
    response = self._make_request(
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 539, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 370, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "f:\today\nni\try_nni.py", line 144, in <module>
    exp3.run(port=8084)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\experiment.py", line 236, in run
    return self._run_impl(port, wait_completion, debug)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\experiment.py", line 205, in _run_impl
    self.start(port, debug)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\experiment\experiment.py", line 270, in start
    self._start_engine_and_strategy()
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\experiment\experiment.py", line 230, in _start_engine_and_strategy
    self.strategy.run()
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\strategy\base.py", line 170, in run
    self._run()
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\strategy\bruteforce.py", line 220, in _run
    if not self.wait_for_resource():
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\strategy\base.py", line 100, in wait_for_resource
    if not self.engine.budget_available():
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\execution\training_service.py", line 271, in budget_available
    return self.nodejs_binding.get_status() in ['INITIALIZED', 'RUNNING', 'TUNER_NO_MORE_TRIAL']
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\experiment.py", line 413, in get_status
    resp = rest.get(self.port, '/check-status', self.url_prefix)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\rest.py", line 43, in get
    return request('get', port, api, prefix=prefix)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\rest.py", line 31, in request
    resp = requests.request(method, url, timeout=timeout)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)
[2024-02-20 22:24:59] Stopping experiment, please wait...
[2024-02-20 22:25:00] Checkpoint saved to C:\Users\DELL\nni-experiments\5p9fhwgt\checkpoint.
[2024-02-20 22:25:20] ERROR: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)
Traceback (most recent call last):
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connection.py", line 466, in getresponse
    httplib_response = super().getresponse()
  File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 1375, in getresponse
    response.begin()
  File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "E:\conda\envs\pytorch_nni\lib\socket.py", line 705, in readinto
    return self._sock.recv_into(b)
TimeoutError: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\adapters.py", line 486, in send
    resp = conn.urlopen(
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 847, in urlopen
    retries = retries.increment(
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\util\retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\util\util.py", line 39, in reraise
    raise value
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 793, in urlopen
    response = self._make_request(
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 539, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 370, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\experiment.py", line 171, in _stop_nni_manager
    rest.delete(self.port, '/experiment', self.url_prefix)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\rest.py", line 52, in delete
    request('delete', port, api, prefix=prefix)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\rest.py", line 31, in request
    resp = requests.request(method, url, timeout=timeout)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)
[2024-02-20 22:25:20] WARNING: Cannot gracefully stop experiment, killing NNI process...
[2024-02-20 22:25:21] ERROR: Failed to receive command. Retry in 0s

misagamisaga avatar Feb 21 '24 14:02 misagamisaga

I have the same issue.

534145232 avatar Mar 22 '24 13:03 534145232

same issue too...... But I found if we set exp.config.trial_gpu_number = 0,the experiment can be launched without using GPU.

Imfire-waw avatar Apr 15 '24 08:04 Imfire-waw

same issue too...... But I found if we set exp.config.trial_gpu_number = 0,the experiment can be launched without using GPU. but it is too slow

ranranrannervous avatar Apr 18 '24 11:04 ranranrannervous

It may caused by dwm.exe or NVIDIA driver. Updating GPU driver or changing to studio version didn't work.

Windows 11 22631.3447, Intel i9-14900HX, RTX4090, Nvidia studio driver 552.22.

zhxn30663 avatar Apr 23 '24 03:04 zhxn30663

Any news on this?

raseidi avatar Jul 02 '24 21:07 raseidi

Have there been any updates? There is still an issue :(

rareHalex avatar Jul 26 '24 05:07 rareHalex

I want to know what's wrong in this project

Hershel1215 avatar Jul 30 '24 10:07 Hershel1215