Can not build connection to server's port.
Describe the issue:
I tried to run nnictl create --config scripts/exp.yml -p 8079, but got error information:
[2023-08-14 12:52:48] Creating experiment, Experiment ID: 7fgdry6w
[2023-08-14 12:52:48] Starting web server...
[2023-08-14 12:52:49] WARNING: Timeout, retry...
[2023-08-14 12:52:50] WARNING: Timeout, retry...
[2023-08-14 12:52:51] ERROR: Create experiment failed.
Actually, I find that there is a closed issue #5126 about similar problem but with a little difference. My 7fgdry6w directory only have /log directory and have no /db directory. The detailed trace back information are as below:
Traceback (most recent call last):
File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection
raise err
File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen chunked=chunked, File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/connection.py", line 239, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/home/anaconda3/envs/env1/lib/python3.7/http/client.py", line 1281, in request self._send_request(method, url, body, headers, encode_chunked) File "/home/anaconda3/envs/env1/lib/python3.7/http/client.py", line 1327, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/home/anaconda3/envs/env1/lib/python3.7/http/client.py", line 1276, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/home/anaconda3/envs/env1/lib/python3.7/http/client.py", line 1036, in _send_output self.send(msg) File "/home/anaconda3/envs/env1/lib/python3.7/http/client.py", line 976, in send self.connect() File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect conn = self._new_conn() File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn self, "Failed to establish a new connection: %s" % e urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fcec414a810>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/requests/adapters.py", line 450, in send timeout=timeout File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/connectionpool.py", line 788, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/home/anaconda3/envs/env1/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8079): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcec414a810>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/anaconda3/envs/env1/bin/nnictl", line 8, in
Environment:
- NNI version:
- Training service (local|remote|pai|aml|etc): remote
- Client OS:
- Server OS (for remote mode only):
- Python version: 3.7
- PyTorch/TensorFlow version: PyTorch
- Is conda/virtualenv/venv used?: conda
- Is running in Docker?: No
Configuration:
- Experiment config (remember to remove secrets!): experimentName: scaling law
trialCommand: python /home/NSL_MRL/main.py --epochs=1000 --lr=5e-4 --batch_size=256 trialGpuNumber: 1 trialConcurrency: 5 maxExperimentDuration: 100000h maxTrialNumber: 100000 tuner: name: GridSearch trainingService: platform: local useActiveGpu: True gpuIndices: [0,1,2,4,5] maxTrialNumberPerGpu: 1
- Search space: finetune_ratio: _type: choice _value: [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.6 ,0.8, 0.9999999]
Log message:
- nnimanager.log: No such file.
- dispatcher.log: No such file.
- nnictl stdout and stderr: They both print information as below:
Experiment 7fgdry6w start: 2023-08-14 12:52:48.385616
How to reproduce it?:
Moreover, I used the same command nnictl create --config scripts/exp.yml three days ago and it worked and run well. But today when I tried to run it on another port but got the aforementioned error.
And My torch==1.10.2+cu113 and nni==2.10.1.
I can ensure that port 8079 is available.
Same problem here with nni 3.0
Same problem here with nni 3.0. But when use nni 2.5, the problem will disappear.
I check the nnictl_error.log and find that casued by this: node:/lib64/libm.so.6: version 'GLIBC_2.27' not found (required by node). It seems I need to install GLIBC_2.27. But how can I address it as I have no access to sudo.