nni
nni copied to clipboard
Failed to establish a new connection
I try to use nni in the HPC at our school. The code is work on my computer. The HPC has many compute nodes and we should submit the tasks on the manager node. But this error raise:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=17513): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x2b352d9c6f28>: Failed to establish a new connection: [Errno 111] Connection refused',))
I think it might be related to the url. may be I should use nniManagerIP to fix this problem? what host should i specify?
Hi @Roy-Kid, are you using remote mode to submit job? could you share your full content of nniManager.log?
Hi @Roy-Kid, are you using remote mode to submit job? could you share your full content of nniManager.log?
Hi, the experiment fails at the very beginning then the log fold can not be create. Here is some errors raise:
[2021-04-01 23:06:50] Timeout, retry...
[2021-04-01 23:06:51] Create experiment failed
Traceback (most recent call last):
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connection.py", line 170, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/util/connection.py", line 96, in create_connection
raise err
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/util/connection.py", line 86, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py", line 394, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connection.py", line 234, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/http/client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/http/client.py", line 1026, in _send_output
self.send(msg)
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/http/client.py", line 964, in send
self.connect()
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connection.py", line 200, in connect
conn = self._new_conn()
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connection.py", line 182, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f3069207630>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py", line 756, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/util/retry.py", line 574, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=17513): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3069207630>: Failed to establish a new connection: [Errno 111] Connection refused',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "launch.py", line 32, in <module>
experiment.run(17513)
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/experiment.py", line 156, in run
self.start(port, debug)
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/experiment.py", line 112, in start
self._proc = launcher.start_experiment(self.id, self.config, port, debug)
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/launcher.py", line 51, in start_experiment
raise e
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/launcher.py", line 38, in start_experiment
_check_rest_server(port)
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/launcher.py", line 145, in _check_rest_server
rest.get(port, '/check-status')
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/rest.py", line 26, in get
return request('get', port, api)
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/rest.py", line 16, in request
resp = requests.request(method, url, timeout=timeout)
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=17513): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3069207630>: Failed to establish a new connection: [Errno 111] Connection refused',))
[2021-04-01 23:06:52] Stopping experiment, please wait...
[2021-04-01 23:06:52] Experiment stopped
PS: After the local mode fails, I try to use remote mode to run the experiment in the HPC of our school. This time the experiment can establish successfully, but the trials are always running in the WebUI. I turn to check out the job queue but find no unfinished job. Our task should use bash script to submit to the HPC, so I set the trial command as "bsub < work.lsf", but no task is submitted. So I want to ask by the way, how to use nni under this circumstance?
Hi @Roy-Kid , from the error information, seems NNI fails to connect to local service localhost:17513, could you please make sure the port 17513 is available on your environment? You could use nnictl create --config {config_path} --port {port_number}
to set another ports when creating new experiments.
In your remote mode, do you mean that NNI could submit job successfully, but trial status stuck in Running
state? could you use nnictl create --config {config_path} --debug
to start experiment, and provide nniManager.log file here?
hello @Roy-Kid, could you follow this and update the status of the issue? Thank you!
Hi @Roy-Kid , from the error information, seems NNI fails to connect to local service localhost:17513, could you please make sure the port 17513 is available on your environment? You could use
nnictl create --config {config_path} --port {port_number}
to set another ports when creating new experiments. In your remote mode, do you mean that NNI could submit job successfully, but trial status stuck inRunning
state? could you usennictl create --config {config_path} --debug
to start experiment, and provide nniManager.log file here?
Hi, @SparkSnail @kvartet ! I have left the institute and not use HPC anymore, so I hardly test the new version. So sorry for that. Once I have the chance I will try it ASAP.
I think the confusing thing is that we submit the task by using a queue system like PBS, so how to write the script to run the trials, not on the management node makes me confused. If you have any idea, please update the tutorial :-) It is much more helpful for those who do not familiar with Linux!
Thanks again for your selfless help!
We have the same problem. requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fdbe7b0a250>: Failed to establish a new connection: [Errno 111] Connection refused'))
The details: (pytorch) wy@Tiger:~/mnist-pytorch$ nnictl create --config config_windows.yml [2022-06-09 13:32:46] Creating experiment, Experiment ID: k5doghe7 [2022-06-09 13:32:46] Starting web server... [2022-06-09 13:32:47] WARNING: Timeout, retry... [2022-06-09 13:32:48] WARNING: Timeout, retry... [2022-06-09 13:32:49] ERROR: Create experiment failed Traceback (most recent call last): File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw File "/home/wy/.local/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection raise err File "/home/wy/.local/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen chunked=chunked, File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connection.py", line 239, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py", line 1281, in request self._send_request(method, url, body, headers, encode_chunked) File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py", line 1327, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py", line 1276, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py", line 1036, in _send_output self.send(msg) File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py", line 976, in send self.connect() File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect conn = self._new_conn() File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn self, "Failed to establish a new connection: %s" % e urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fdbe7b0a250>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/site-packages/requests/adapters.py", line 450, in send timeout=timeout File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 786, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/home/wy/.local/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fdbe7b0a250>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/wy/.local/bin/nnictl", line 8, in
We have the same issue:
Reference: https://nni.readthedocs.io/en/stable/reference/experiment_config.html [2022-06-30 21:45:45] Creating experiment, Experiment ID: in59ltr2 [2022-06-30 21:45:45] Starting web server... [2022-06-30 21:45:46] WARNING: Timeout, retry... [2022-06-30 21:45:47] WARNING: Timeout, retry... [2022-06-30 21:45:48] ERROR: Create experiment failed Traceback (most recent call last): File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection raise err File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen chunked=chunked, File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connection.py", line 239, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/http/client.py", line 1281, in request self._send_request(method, url, body, headers, encode_chunked) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/http/client.py", line 1327, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/http/client.py", line 1276, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/http/client.py", line 1036, in _send_output self.send(msg) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/http/client.py", line 976, in send self.connect() File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect conn = self._new_conn() File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn self, "Failed to establish a new connection: %s" % e urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f0353f50e10>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/requests/adapters.py", line 499, in send timeout=timeout, File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connectionpool.py", line 786, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=7008): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0353f50e10>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/opt/anaconda3/envs/flowtorch_config/bin/nnictl", line 8, in sys.exit(parse_args()) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/tools/nnictl/nnictl.py", line 497, in parse_args args.func(args) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/tools/nnictl/launcher.py", line 91, in create_experiment exp.start(port, debug, RunMode.Detach) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/experiment.py", line 135, in start self._start_impl(port, debug, run_mode, None, []) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/experiment.py", line 104, in _start_impl self.url_prefix, tuner_command_channel, tags) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/launcher.py", line 147, in start_experiment raise e File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/launcher.py", line 125, in start_experiment _check_rest_server(port, url_prefix=url_prefix) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/launcher.py", line 195, in _check_rest_server rest.get(port, '/check-status', url_prefix) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/rest.py", line 43, in get return request('get', port, api, prefix=prefix) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/rest.py", line 31, in request resp = requests.request(method, url, timeout=timeout) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, **send_kwargs) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/requests/adapters.py", line 565, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=7008): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0353f50e10>: Failed to establish a new connection: [Errno 111] Connection refused'))
Add @liuzhe-lz for help.
Hi, everyone! I met the same problem when I run my code with nni (v. 2.8). However, the same code works successfully with nni (v. 2.5). It might be a solution to install nni v.2.5 and I also hope someone can find out what's wrong in the newest version.
我将版本回退到2.5可行,这个报错就没有了
Hi, everyone! I met the same problem when I run my code with nni (v. 2.8). However, the same code works successfully with nni (v. 2.5). It might be a solution to install nni v.2.5 and I also hope someone can find out what's wrong in the newest version.
Thanks @wmbai.
@liuzhe-lz - cc scrum master @ultmaster - this might be an regression of v2.8.
i got the same error in v2.9
hi @xiangtaowong Looks like your same error had got solved in issue https://github.com/microsoft/nni/issues/5126, yes?
hi @xiangtaowong Looks like your same error had got solved in issue #5126, yes?
yes, I got the same error, and I follow his suggestion that changing all the data and output path to /home
, without the remote disk, and sometimes it works.
But also sometimes it doesn't work, maybe another reason is due to a change in the item of experimentWorkingDirectory
in the config.yml
, and maybe you could see @szhang963 's HighEffiNNI
for some possible results
Is there a solution to this? I'm not using a config.yml file, I set the configuration in the python script ( as in Hello NAS example). A week or so ago I was able to start the web server on my institute cluster, but now I keep getting the same error.
As of v2.10, this error generally means "NNI manager fails to start" (since NNI manager is running in another process, we have trouble displaying the real reason why it fails to start. We can only tell that we can't connect to that process.)
To see the real exception, please go to ~/nni-experiments/<experiment_id> and check the logs inside.
I get the same error if I want to view a previous experiment with nnictl view. I have some experiments files from the couple of days I was able to start the web serve. The nnictl logs don’t show much, to the experiment.log the following was added:
[2023-01-18 15:17:39] INFO (nni.nas.experiment.pytorch) Stopping experiment, please wait... [2023-01-27 10:28:09] INFO (nni.experiment) Creating experiment, Experiment ID: 8nfh3acj [2023-01-27 10:28:09] INFO (nni.experiment) Starting web server... [2023-01-27 10:28:10] WARNING (nni.experiment) Timeout, retry... [2023-01-27 10:28:11] WARNING (nni.experiment) Timeout, retry... [2023-01-27 10:28:12] ERROR (nni.experiment) Create experiment failed
If I try to start a fresh experiment, it only creates a log directory with a single experiment.log file, which also contains the same output above and nothing else. Is there another place I can look to find the real source of the error? From: Yuge Zhang @.> Sent: Friday, 27 January 2023 14:16 To: microsoft/nni @.> Cc: Julia Wąsala @.>; Comment @.> Subject: Re: [microsoft/nni] Failed to establish a new connection (#3496)
As of v2.10, this error generally means "NNI manager fails to start" (since NNI manager is running in another process, we have trouble displaying the real reason why it fails to start. We can only tell that we can't connect to that process.)
To see the real exception, please go to ~/nni-experiments/<experiment_id> and check the logs inside.
— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/nni/issues/3496#issuecomment-1406496977, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIISASUFGYW3UXINBUVQPSLWUPC7RANCNFSM4Z7N5N2Q. You are receiving this because you commented.Message ID: @.@.>>
I get the same error if I want to view a previous experiment with nnictl view. I have some experiments files from the couple of days I was able to start the web serve. The nnictl logs don’t show much, to the experiment.log the following was added:
`
[2023-01-18 15:17:39] INFO (nni.nas.experiment.pytorch) Stopping experiment, please wait...
[2023-01-27 10:28:09] INFO (nni.experiment) Creating experiment, Experiment ID: 8nfh3acj
[2023-01-27 10:28:09] INFO (nni.experiment) Starting web server...
[2023-01-27 10:28:10] WARNING (nni.experiment) Timeout, retry...
[2023-01-27 10:28:11] WARNING (nni.experiment) Timeout, retry...
[2023-01-27 10:28:12] ERROR (nni.experiment) Create experiment failed
`
If I try to start a fresh experiment, it only creates a log directory with a single experiment.log file, which also contains the same output above and nothing else. Is there another place I can look to find the real source of the error?
From: Yuge Zhang @.***>
Sent: Friday, 27 January 2023 14:16
To: microsoft/nni @.***>
Cc: Julia Wąsala @.>; Comment @.>
Subject: Re: [microsoft/nni] Failed to establish a new connection (#3496)
As of v2.10, this error generally means "NNI manager fails to start" (since NNI manager is running in another process, we have trouble displaying the real reason why it fails to start. We can only tell that we can't connect to that process.)
To see the real exception, please go to ~/nni-experiments/<experiment_id> and check the logs inside.
—
Reply to this email directly, view it on GitHubhttps://github.com/microsoft/nni/issues/3496#issuecomment-1406496977, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIISASUFGYW3UXINBUVQPSLWUPC7RANCNFSM4Z7N5N2Q.
You are receiving this because you commented.Message ID: @.@.>>
Can you find a nnimanager.log? experiment.log wasn't really helpful because it's also from the Python side.
None of the experiments that failed with the "failed to establish connection" error have a nnimanager.log; the only file in those experiment folders is the experiment.log. If I use nnictl view to view a previous experiment, nothing is added to the pre-existing nnimanager.log
The same issue "ConnectionRefusedError: [Errno 111] Connect call failed ('127.0.0.1', 8088)"
v3.0 will fix this issue, please wait the new release of nni
@Lijiaoa when will v3.0 be released? I got the same issue..
https://github.com/microsoft/nni/issues/5418#issuecomment-1475473500
I have a simple fix for this issue: give it more retries.
https://github.com/microsoft/nni/blob/e101717234a9c2b44ea62cea4492b9f391824c0f/nni/experiment/launcher.py#L125
Change the line into the following:
_check_rest_server(port, retry=30, url_prefix=url_prefix)
Many people may work on a cluster without sufficient CPU resources. 3 seconds might be too strict to start a server.
thanks for sharing @why-in-Shanghaitech @Lijiaoa
Hi, I am facing the issue on the latest version. Any suggestions? ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)
Hi, I am facing the issue on the latest version. Any suggestions? ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)
Got the same issue, experiment would stop unexpectedly because of loss of connection. Any solution for the issue ? Hope to hear your expertise. @ShrutiSarikaChakraborty
Hello,
I just switched to the legacy version.
Thanks, Shruti
On Wed, 5 Jul 2023, 21:35 lukelu312, @.***> wrote:
Hi, I am facing the issue on the latest version. Any suggestions? ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)
Got the same issue, experiment would stop unexpectedly because of loss of connection. Any solution for the issue ? Hope to hear your expertise. @ShrutiSarikaChakraborty https://github.com/ShrutiSarikaChakraborty
— Reply to this email directly, view it on GitHub https://github.com/microsoft/nni/issues/3496#issuecomment-1622463307, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI4CVYHCPNNHADCDSQJQUPDXOXF2XANCNFSM4Z7N5N2Q . You are receiving this because you were mentioned.Message ID: @.***>
Hello, I just switched to the legacy version. Thanks, Shruti … On Wed, 5 Jul 2023, 21:35 lukelu312, @.> wrote: Hi, I am facing the issue on the latest version. Any suggestions? ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None) Got the same issue, experiment would stop unexpectedly because of loss of connection. Any solution for the issue ? Hope to hear your expertise. @ShrutiSarikaChakraborty https://github.com/ShrutiSarikaChakraborty — Reply to this email directly, view it on GitHub <#3496 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI4CVYHCPNNHADCDSQJQUPDXOXF2XANCNFSM4Z7N5N2Q . You are receiving this because you were mentioned.Message ID: @.>
Which legacy version are you using, v2.10.1 or a lower one ? Thanks for your reply @ShrutiSarikaChakraborty