nni icon indicating copy to clipboard operation
nni copied to clipboard

Failed to establish a new connection

Open Roy-Kid opened this issue 3 years ago • 34 comments

I try to use nni in the HPC at our school. The code is work on my computer. The HPC has many compute nodes and we should submit the tasks on the manager node. But this error raise:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=17513): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x2b352d9c6f28>: Failed to establish a new connection: [Errno 111] Connection refused',))

I think it might be related to the url. may be I should use nniManagerIP to fix this problem? what host should i specify?

Roy-Kid avatar Mar 29 '21 11:03 Roy-Kid

Hi @Roy-Kid, are you using remote mode to submit job? could you share your full content of nniManager.log?

SparkSnail avatar Apr 01 '21 15:04 SparkSnail

Hi @Roy-Kid, are you using remote mode to submit job? could you share your full content of nniManager.log?

Hi, the experiment fails at the very beginning then the log fold can not be create. Here is some errors raise:

[2021-04-01 23:06:50] Timeout, retry...
[2021-04-01 23:06:51] Create experiment failed
Traceback (most recent call last):
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connection.py", line 170, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connection.py", line 234, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/http/client.py", line 964, in send
    self.connect()
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connection.py", line 200, in connect
    conn = self._new_conn()
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connection.py", line 182, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f3069207630>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=17513): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3069207630>: Failed to establish a new connection: [Errno 111] Connection refused',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "launch.py", line 32, in <module>
    experiment.run(17513)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/experiment.py", line 156, in run
    self.start(port, debug)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/experiment.py", line 112, in start
    self._proc = launcher.start_experiment(self.id, self.config, port, debug)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/launcher.py", line 51, in start_experiment
    raise e
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/launcher.py", line 38, in start_experiment
    _check_rest_server(port)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/launcher.py", line 145, in _check_rest_server
    rest.get(port, '/check-status')
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/rest.py", line 26, in get
    return request('get', port, api)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/nni/experiment/rest.py", line 16, in request
    resp = requests.request(method, url, timeout=timeout)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/public/home/Klsr_yqli_04/liuly/anaconda3/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=17513): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3069207630>: Failed to establish a new connection: [Errno 111] Connection refused',))
[2021-04-01 23:06:52] Stopping experiment, please wait...
[2021-04-01 23:06:52] Experiment stopped

PS: After the local mode fails, I try to use remote mode to run the experiment in the HPC of our school. This time the experiment can establish successfully, but the trials are always running in the WebUI. I turn to check out the job queue but find no unfinished job. Our task should use bash script to submit to the HPC, so I set the trial command as "bsub < work.lsf", but no task is submitted. So I want to ask by the way, how to use nni under this circumstance?

Roy-Kid avatar Apr 01 '21 15:04 Roy-Kid

Hi @Roy-Kid , from the error information, seems NNI fails to connect to local service localhost:17513, could you please make sure the port 17513 is available on your environment? You could use nnictl create --config {config_path} --port {port_number} to set another ports when creating new experiments.
In your remote mode, do you mean that NNI could submit job successfully, but trial status stuck in Running state? could you use nnictl create --config {config_path} --debug to start experiment, and provide nniManager.log file here?

SparkSnail avatar Apr 02 '21 05:04 SparkSnail

hello @Roy-Kid, could you follow this and update the status of the issue? Thank you!

Hi @Roy-Kid , from the error information, seems NNI fails to connect to local service localhost:17513, could you please make sure the port 17513 is available on your environment? You could use nnictl create --config {config_path} --port {port_number} to set another ports when creating new experiments. In your remote mode, do you mean that NNI could submit job successfully, but trial status stuck in Running state? could you use nnictl create --config {config_path} --debug to start experiment, and provide nniManager.log file here?

kvartet avatar Jun 10 '21 13:06 kvartet

Hi, @SparkSnail @kvartet ! I have left the institute and not use HPC anymore, so I hardly test the new version. So sorry for that. Once I have the chance I will try it ASAP.

I think the confusing thing is that we submit the task by using a queue system like PBS, so how to write the script to run the trials, not on the management node makes me confused. If you have any idea, please update the tutorial :-) It is much more helpful for those who do not familiar with Linux!

Thanks again for your selfless help!

Roy-Kid avatar Jun 10 '21 13:06 Roy-Kid

We have the same problem. requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fdbe7b0a250>: Failed to establish a new connection: [Errno 111] Connection refused'))

wuyong-hdu avatar Jun 09 '22 06:06 wuyong-hdu

The details: (pytorch) wy@Tiger:~/mnist-pytorch$ nnictl create --config config_windows.yml [2022-06-09 13:32:46] Creating experiment, Experiment ID: k5doghe7 [2022-06-09 13:32:46] Starting web server... [2022-06-09 13:32:47] WARNING: Timeout, retry... [2022-06-09 13:32:48] WARNING: Timeout, retry... [2022-06-09 13:32:49] ERROR: Create experiment failed Traceback (most recent call last): File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw File "/home/wy/.local/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection raise err File "/home/wy/.local/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen chunked=chunked, File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connection.py", line 239, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py", line 1281, in request self._send_request(method, url, body, headers, encode_chunked) File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py", line 1327, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py", line 1276, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py", line 1036, in _send_output self.send(msg) File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py", line 976, in send self.connect() File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect conn = self._new_conn() File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn self, "Failed to establish a new connection: %s" % e urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fdbe7b0a250>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/site-packages/requests/adapters.py", line 450, in send timeout=timeout File "/home/wy/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 786, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/home/wy/.local/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fdbe7b0a250>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/wy/.local/bin/nnictl", line 8, in sys.exit(parse_args()) File "/home/wy/.local/lib/python3.7/site-packages/nni/tools/nnictl/nnictl.py", line 497, in parse_args args.func(args) File "/home/wy/.local/lib/python3.7/site-packages/nni/tools/nnictl/launcher.py", line 92, in create_experiment exp.start(port, debug, run_mode) File "/home/wy/.local/lib/python3.7/site-packages/nni/experiment/experiment.py", line 117, in start self._proc = launcher.start_experiment(self._action, self.id, config, port, debug, run_mode, self.url_prefix) File "/home/wy/.local/lib/python3.7/site-packages/nni/experiment/launcher.py", line 119, in start_experiment raise e File "/home/wy/.local/lib/python3.7/site-packages/nni/experiment/launcher.py", line 97, in start_experiment _check_rest_server(port, url_prefix=url_prefix) File "/home/wy/.local/lib/python3.7/site-packages/nni/experiment/launcher.py", line 258, in _check_rest_server rest.get(port, '/check-status', url_prefix) File "/home/wy/.local/lib/python3.7/site-packages/nni/experiment/rest.py", line 43, in get return request('get', port, api, prefix=prefix) File "/home/wy/.local/lib/python3.7/site-packages/nni/experiment/rest.py", line 31, in request resp = requests.request(method, url, timeout=timeout) File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/site-packages/requests/api.py", line 61, in request return session.request(method=method, url=url, **kwargs) File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/site-packages/requests/sessions.py", line 529, in request resp = self.send(prep, **send_kwargs) File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/site-packages/requests/sessions.py", line 645, in send r = adapter.send(request, **kwargs) File "/home/wy/miniconda3/envs/pytorch/lib/python3.7/site-packages/requests/adapters.py", line 519, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fdbe7b0a250>: Failed to establish a new connection: [Errno 111] Connection refused'))

wuyong-hdu avatar Jun 09 '22 06:06 wuyong-hdu

We have the same issue:

Reference: https://nni.readthedocs.io/en/stable/reference/experiment_config.html [2022-06-30 21:45:45] Creating experiment, Experiment ID: in59ltr2 [2022-06-30 21:45:45] Starting web server... [2022-06-30 21:45:46] WARNING: Timeout, retry... [2022-06-30 21:45:47] WARNING: Timeout, retry... [2022-06-30 21:45:48] ERROR: Create experiment failed Traceback (most recent call last): File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection raise err File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen chunked=chunked, File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connection.py", line 239, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/http/client.py", line 1281, in request self._send_request(method, url, body, headers, encode_chunked) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/http/client.py", line 1327, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/http/client.py", line 1276, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/http/client.py", line 1036, in _send_output self.send(msg) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/http/client.py", line 976, in send self.connect() File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect conn = self._new_conn() File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn self, "Failed to establish a new connection: %s" % e urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f0353f50e10>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/requests/adapters.py", line 499, in send timeout=timeout, File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/connectionpool.py", line 786, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=7008): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0353f50e10>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/anaconda3/envs/flowtorch_config/bin/nnictl", line 8, in sys.exit(parse_args()) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/tools/nnictl/nnictl.py", line 497, in parse_args args.func(args) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/tools/nnictl/launcher.py", line 91, in create_experiment exp.start(port, debug, RunMode.Detach) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/experiment.py", line 135, in start self._start_impl(port, debug, run_mode, None, []) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/experiment.py", line 104, in _start_impl self.url_prefix, tuner_command_channel, tags) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/launcher.py", line 147, in start_experiment raise e File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/launcher.py", line 125, in start_experiment _check_rest_server(port, url_prefix=url_prefix) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/launcher.py", line 195, in _check_rest_server rest.get(port, '/check-status', url_prefix) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/rest.py", line 43, in get return request('get', port, api, prefix=prefix) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/nni/experiment/rest.py", line 31, in request resp = requests.request(method, url, timeout=timeout) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, **send_kwargs) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/opt/anaconda3/envs/flowtorch_config/lib/python3.7/site-packages/requests/adapters.py", line 565, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=7008): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0353f50e10>: Failed to establish a new connection: [Errno 111] Connection refused'))

xztcwang avatar Jul 01 '22 02:07 xztcwang

Add @liuzhe-lz for help.

SparkSnail avatar Jul 04 '22 11:07 SparkSnail

Hi, everyone! I met the same problem when I run my code with nni (v. 2.8). However, the same code works successfully with nni (v. 2.5). It might be a solution to install nni v.2.5 and I also hope someone can find out what's wrong in the newest version.

wmbai avatar Jul 09 '22 15:07 wmbai

我将版本回退到2.5可行,这个报错就没有了

chengpr avatar Aug 08 '22 09:08 chengpr

Hi, everyone! I met the same problem when I run my code with nni (v. 2.8). However, the same code works successfully with nni (v. 2.5). It might be a solution to install nni v.2.5 and I also hope someone can find out what's wrong in the newest version.

Thanks @wmbai.

@liuzhe-lz - cc scrum master @ultmaster - this might be an regression of v2.8.

scarlett2018 avatar Aug 15 '22 08:08 scarlett2018

i got the same error in v2.9

xiangtaowong avatar Sep 13 '22 12:09 xiangtaowong

hi @xiangtaowong Looks like your same error had got solved in issue https://github.com/microsoft/nni/issues/5126, yes?

Lijiaoa avatar Sep 19 '22 03:09 Lijiaoa

hi @xiangtaowong Looks like your same error had got solved in issue #5126, yes?

yes, I got the same error, and I follow his suggestion that changing all the data and output path to /home, without the remote disk, and sometimes it works. But also sometimes it doesn't work, maybe another reason is due to a change in the item of experimentWorkingDirectory in the config.yml, and maybe you could see @szhang963 's HighEffiNNI for some possible results

xiangtaowong avatar Sep 19 '22 05:09 xiangtaowong

Is there a solution to this? I'm not using a config.yml file, I set the configuration in the python script ( as in Hello NAS example). A week or so ago I was able to start the web server on my institute cluster, but now I keep getting the same error.

JuliaWasala avatar Jan 26 '23 15:01 JuliaWasala

As of v2.10, this error generally means "NNI manager fails to start" (since NNI manager is running in another process, we have trouble displaying the real reason why it fails to start. We can only tell that we can't connect to that process.)

To see the real exception, please go to ~/nni-experiments/<experiment_id> and check the logs inside.

ultmaster avatar Jan 27 '23 13:01 ultmaster

I get the same error if I want to view a previous experiment with nnictl view. I have some experiments files from the couple of days I was able to start the web serve. The nnictl logs don’t show much, to the experiment.log the following was added: [2023-01-18 15:17:39] INFO (nni.nas.experiment.pytorch) Stopping experiment, please wait... [2023-01-27 10:28:09] INFO (nni.experiment) Creating experiment, Experiment ID: 8nfh3acj [2023-01-27 10:28:09] INFO (nni.experiment) Starting web server... [2023-01-27 10:28:10] WARNING (nni.experiment) Timeout, retry... [2023-01-27 10:28:11] WARNING (nni.experiment) Timeout, retry... [2023-01-27 10:28:12] ERROR (nni.experiment) Create experiment failed

If I try to start a fresh experiment, it only creates a log directory with a single experiment.log file, which also contains the same output above and nothing else. Is there another place I can look to find the real source of the error? From: Yuge Zhang @.> Sent: Friday, 27 January 2023 14:16 To: microsoft/nni @.> Cc: Julia Wąsala @.>; Comment @.> Subject: Re: [microsoft/nni] Failed to establish a new connection (#3496)

As of v2.10, this error generally means "NNI manager fails to start" (since NNI manager is running in another process, we have trouble displaying the real reason why it fails to start. We can only tell that we can't connect to that process.)

To see the real exception, please go to ~/nni-experiments/<experiment_id> and check the logs inside.

— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/nni/issues/3496#issuecomment-1406496977, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIISASUFGYW3UXINBUVQPSLWUPC7RANCNFSM4Z7N5N2Q. You are receiving this because you commented.Message ID: @.@.>>

JuliaWasala avatar Jan 27 '23 13:01 JuliaWasala

I get the same error if I want to view a previous experiment with nnictl view. I have some experiments files from the couple of days I was able to start the web serve. The nnictl logs don’t show much, to the experiment.log the following was added:

`

[2023-01-18 15:17:39] INFO (nni.nas.experiment.pytorch) Stopping experiment, please wait...

[2023-01-27 10:28:09] INFO (nni.experiment) Creating experiment, Experiment ID: 8nfh3acj

[2023-01-27 10:28:09] INFO (nni.experiment) Starting web server...

[2023-01-27 10:28:10] WARNING (nni.experiment) Timeout, retry...

[2023-01-27 10:28:11] WARNING (nni.experiment) Timeout, retry...

[2023-01-27 10:28:12] ERROR (nni.experiment) Create experiment failed

`

If I try to start a fresh experiment, it only creates a log directory with a single experiment.log file, which also contains the same output above and nothing else. Is there another place I can look to find the real source of the error?

From: Yuge Zhang @.***>

Sent: Friday, 27 January 2023 14:16

To: microsoft/nni @.***>

Cc: Julia Wąsala @.>; Comment @.>

Subject: Re: [microsoft/nni] Failed to establish a new connection (#3496)

As of v2.10, this error generally means "NNI manager fails to start" (since NNI manager is running in another process, we have trouble displaying the real reason why it fails to start. We can only tell that we can't connect to that process.)

To see the real exception, please go to ~/nni-experiments/<experiment_id> and check the logs inside.

Reply to this email directly, view it on GitHubhttps://github.com/microsoft/nni/issues/3496#issuecomment-1406496977, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIISASUFGYW3UXINBUVQPSLWUPC7RANCNFSM4Z7N5N2Q.

You are receiving this because you commented.Message ID: @.@.>>

Can you find a nnimanager.log? experiment.log wasn't really helpful because it's also from the Python side.

ultmaster avatar Jan 28 '23 17:01 ultmaster

None of the experiments that failed with the "failed to establish connection" error have a nnimanager.log; the only file in those experiment folders is the experiment.log. If I use nnictl view to view a previous experiment, nothing is added to the pre-existing nnimanager.log

JuliaWasala avatar Jan 31 '23 10:01 JuliaWasala

The same issue "ConnectionRefusedError: [Errno 111] Connect call failed ('127.0.0.1', 8088)"

wangyanhao0517 avatar Mar 12 '23 08:03 wangyanhao0517

v3.0 will fix this issue, please wait the new release of nni

Lijiaoa avatar Mar 13 '23 07:03 Lijiaoa

@Lijiaoa when will v3.0 be released? I got the same issue..

LeiWang1999 avatar Mar 18 '23 15:03 LeiWang1999

https://github.com/microsoft/nni/issues/5418#issuecomment-1475473500

Lijiaoa avatar Mar 20 '23 09:03 Lijiaoa

I have a simple fix for this issue: give it more retries.

https://github.com/microsoft/nni/blob/e101717234a9c2b44ea62cea4492b9f391824c0f/nni/experiment/launcher.py#L125

Change the line into the following:

_check_rest_server(port, retry=30, url_prefix=url_prefix)

Many people may work on a cluster without sufficient CPU resources. 3 seconds might be too strict to start a server.

why-in-Shanghaitech avatar Mar 28 '23 06:03 why-in-Shanghaitech

thanks for sharing @why-in-Shanghaitech @Lijiaoa

LeiWang1999 avatar Mar 29 '23 09:03 LeiWang1999

Hi, I am facing the issue on the latest version. Any suggestions? ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)

ShrutiSarikaChakraborty avatar May 14 '23 23:05 ShrutiSarikaChakraborty

Hi, I am facing the issue on the latest version. Any suggestions? ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)

Got the same issue, experiment would stop unexpectedly because of loss of connection. Any solution for the issue ? Hope to hear your expertise. @ShrutiSarikaChakraborty

lukelu312 avatar Jul 05 '23 20:07 lukelu312

Hello,

I just switched to the legacy version.

Thanks, Shruti

On Wed, 5 Jul 2023, 21:35 lukelu312, @.***> wrote:

Hi, I am facing the issue on the latest version. Any suggestions? ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)

Got the same issue, experiment would stop unexpectedly because of loss of connection. Any solution for the issue ? Hope to hear your expertise. @ShrutiSarikaChakraborty https://github.com/ShrutiSarikaChakraborty

— Reply to this email directly, view it on GitHub https://github.com/microsoft/nni/issues/3496#issuecomment-1622463307, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI4CVYHCPNNHADCDSQJQUPDXOXF2XANCNFSM4Z7N5N2Q . You are receiving this because you were mentioned.Message ID: @.***>

ShrutiSarikaChakraborty avatar Jul 06 '23 02:07 ShrutiSarikaChakraborty

Hello, I just switched to the legacy version. Thanks, Shruti On Wed, 5 Jul 2023, 21:35 lukelu312, @.> wrote: Hi, I am facing the issue on the latest version. Any suggestions? ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None) Got the same issue, experiment would stop unexpectedly because of loss of connection. Any solution for the issue ? Hope to hear your expertise. @ShrutiSarikaChakraborty https://github.com/ShrutiSarikaChakraborty — Reply to this email directly, view it on GitHub <#3496 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI4CVYHCPNNHADCDSQJQUPDXOXF2XANCNFSM4Z7N5N2Q . You are receiving this because you were mentioned.Message ID: @.>

Which legacy version are you using, v2.10.1 or a lower one ? Thanks for your reply @ShrutiSarikaChakraborty

lukelu312 avatar Jul 06 '23 03:07 lukelu312