nni
nni copied to clipboard
The system crashes when running the “Hello NAS” example code with GPU
The system crashes when running the “Hello NAS” example code with GPU
- The crash occurs in three environments: colab, Windows11 system using conda, and Windows11 system without using conda.
- In all three environments, I tried to downgrade pytorch to version 13.0, but it still crashes.
- And it crashes when running in both .py and .ipynb modes.
My steps
- colab: use
pip
to installnni
andlightning
, then run the “Hello NAS” example code. - Windows11: install
pytorch
(using the official installation instructions, which also installstorchvision
),nni
,lightning
,ipykernel
,jupyterlab
I cleared my environment beforehand, there are no extra package conflicts
The details of one of the errors
(Env: Windows11, using conda, pytorch2.2.0) Before the crash, I saw a lot of python.exe in the task manager I recorded the error at that time:
[2024-02-20 22:24:17] Creating experiment, Experiment ID: 5p9fhwgt
[2024-02-20 22:24:17] Starting web server...
[2024-02-20 22:24:20] Setting up...
[2024-02-20 22:24:20] Web portal URLs: http://26.26.26.1:8084 http://169.254.77.17:8084 http://169.254.202.152:8084 http://169.254.67.238:8084 http://192.168.101.15:8084 http://127.0.0.1:8084
[2024-02-20 22:24:21] Successfully update searchSpace.
[2024-02-20 22:24:21] Checkpoint saved to C:\Users\DELL\nni-experiments\5p9fhwgt\checkpoint.
[2024-02-20 22:24:21] Experiment initialized successfully. Starting exploration strategy...
[2024-02-20 22:24:59] ERROR: Strategy failed to execute.
Traceback (most recent call last):
File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 537, in _make_request
response = conn.getresponse()
File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connection.py", line 466, in getresponse
httplib_response = super().getresponse()
File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 1375, in getresponse
response.begin()
File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 318, in begin
version, status, reason = self._read_status()
File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 279, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "E:\conda\envs\pytorch_nni\lib\socket.py", line 705, in readinto
return self._sock.recv_into(b)
TimeoutError: timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\adapters.py", line 486, in send
resp = conn.urlopen(
File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 847, in urlopen
retries = retries.increment(
File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\util\retry.py", line 470, in increment
raise reraise(type(error), error, _stacktrace)
File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\util\util.py", line 39, in reraise
raise value
File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 793, in urlopen
response = self._make_request(
File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 539, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 370, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "f:\today\nni\try_nni.py", line 144, in <module>
exp3.run(port=8084)
File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\experiment.py", line 236, in run
return self._run_impl(port, wait_completion, debug)
File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\experiment.py", line 205, in _run_impl
self.start(port, debug)
File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\experiment\experiment.py", line 270, in start
self._start_engine_and_strategy()
File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\experiment\experiment.py", line 230, in _start_engine_and_strategy
self.strategy.run()
File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\strategy\base.py", line 170, in run
self._run()
File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\strategy\bruteforce.py", line 220, in _run
if not self.wait_for_resource():
File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\strategy\base.py", line 100, in wait_for_resource
if not self.engine.budget_available():
File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\nas\execution\training_service.py", line 271, in budget_available
return self.nodejs_binding.get_status() in ['INITIALIZED', 'RUNNING', 'TUNER_NO_MORE_TRIAL']
File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\experiment.py", line 413, in get_status
resp = rest.get(self.port, '/check-status', self.url_prefix)
File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\rest.py", line 43, in get
return request('get', port, api, prefix=prefix)
File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\rest.py", line 31, in request
resp = requests.request(method, url, timeout=timeout)
File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\adapters.py", line 532, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)
[2024-02-20 22:24:59] Stopping experiment, please wait...
[2024-02-20 22:25:00] Checkpoint saved to C:\Users\DELL\nni-experiments\5p9fhwgt\checkpoint.
[2024-02-20 22:25:20] ERROR: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)
Traceback (most recent call last):
File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 537, in _make_request
response = conn.getresponse()
File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connection.py", line 466, in getresponse
httplib_response = super().getresponse()
File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 1375, in getresponse
response.begin()
File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 318, in begin
version, status, reason = self._read_status()
File "E:\conda\envs\pytorch_nni\lib\http\client.py", line 279, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "E:\conda\envs\pytorch_nni\lib\socket.py", line 705, in readinto
return self._sock.recv_into(b)
TimeoutError: timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\adapters.py", line 486, in send
resp = conn.urlopen(
File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 847, in urlopen
retries = retries.increment(
File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\util\retry.py", line 470, in increment
raise reraise(type(error), error, _stacktrace)
File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\util\util.py", line 39, in reraise
raise value
File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 793, in urlopen
response = self._make_request(
File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 539, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "E:\conda\envs\pytorch_nni\lib\site-packages\urllib3\connectionpool.py", line 370, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\experiment.py", line 171, in _stop_nni_manager
rest.delete(self.port, '/experiment', self.url_prefix)
File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\rest.py", line 52, in delete
request('delete', port, api, prefix=prefix)
File "E:\conda\envs\pytorch_nni\lib\site-packages\nni\experiment\rest.py", line 31, in request
resp = requests.request(method, url, timeout=timeout)
File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "E:\conda\envs\pytorch_nni\lib\site-packages\requests\adapters.py", line 532, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=8084): Read timed out. (read timeout=20)
[2024-02-20 22:25:20] WARNING: Cannot gracefully stop experiment, killing NNI process...
[2024-02-20 22:25:21] ERROR: Failed to receive command. Retry in 0s
I have the same issue.
same issue too...... But I found if we set exp.config.trial_gpu_number = 0,the experiment can be launched without using GPU.
same issue too...... But I found if we set exp.config.trial_gpu_number = 0,the experiment can be launched without using GPU. but it is too slow
It may caused by dwm.exe or NVIDIA driver. Updating GPU driver or changing to studio version didn't work.
Windows 11 22631.3447, Intel i9-14900HX, RTX4090, Nvidia studio driver 552.22.
Any news on this?
Have there been any updates? There is still an issue :(
I want to know what's wrong in this project