aim icon indicating copy to clipboard operation
aim copied to clipboard

Securing Aim Remote Tracking server using SSL key and certificate

Open JeroenVranken opened this issue 1 year ago • 9 comments
trafficstars

Securing Aim Remote Tracking server using SSL key and certificate

Hi, first of all I appreciate all the work you've put into making Aim!

I am having some trouble securing the connection to the Aim Remote Tracking (RT) Server, and was wondering if you could help me out.

I recently setup a virtual machine on Azure, which is running both the Aim RT Server and the Aim UI. To do this, I have used a docker-compose.yml, which brings up both the server and the UI. This is working properly, I can log runs from another machine and see them appear in the UI, great.

However, now I want to secure the connection to the remote tracking server using SSL, as described here. I've created a self-signed key and certificate file using openssl, as described here.

Whenever I bring up the server using this command, eveything seems in working order, I do not get any errors etc:

aim server --repo ~/mycontainer/aim/ --ssl-keyfile ~/secrets/server.key --ssl-certfile ~/secrets/server.crt --host 0.0.0.0 --dev --port 53800

But then when I try to log a run from another machine, I get the following error on the client:

azureuser@ml-ci-jvranken-prd:~/cloudfiles/code/Users/jvranken/aim-tracking-server$ python aim_test.py 
Failed to connect to Aim Server. Have you forgot to run `aim server` command?
Traceback (most recent call last):
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/urllib3/connectionpool.py", line 467, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/urllib3/connectionpool.py", line 462, in _make_request
    httplib_response = conn.getresponse()
  File "/anaconda/envs/verhuiskans/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/anaconda/envs/verhuiskans/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/anaconda/envs/verhuiskans/lib/python3.10/http/client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/urllib3/connectionpool.py", line 799, in urlopen
    retries = retries.increment(
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/urllib3/connectionpool.py", line 467, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/urllib3/connectionpool.py", line 462, in _make_request
    httplib_response = conn.getresponse()
  File "/anaconda/envs/verhuiskans/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/anaconda/envs/verhuiskans/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/anaconda/envs/verhuiskans/lib/python3.10/http/client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/ext/transport/utils.py", line 14, in wrapper
    return func(*args, **kwargs)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/ext/transport/client.py", line 138, in connect
    response = requests.get(endpoint, headers=self.request_headers)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/requests/adapters.py", line 682, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/ml-ci-jvranken-prd/code/Users/jvranken/aim-tracking-server/aim_test.py", line 7, in <module>
    run = Run(
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/ext/exception_resistant.py", line 70, in wrapper
    _SafeModeConfig.exception_callback(e, func)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/ext/exception_resistant.py", line 47, in reraise_exception
    raise e
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/ext/exception_resistant.py", line 68, in wrapper
    return func(*args, **kwargs)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/sdk/run.py", line 859, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, experiment=experiment, force_resume=force_resume)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/sdk/run.py", line 272, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, force_resume=force_resume)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/sdk/base_run.py", line 34, in __init__
    self.repo = get_repo(repo)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/sdk/repo_utils.py", line 26, in get_repo
    repo = Repo.from_path(repo)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/sdk/repo.py", line 210, in from_path
    repo = Repo(path, read_only=read_only, init=init)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/sdk/repo.py", line 121, in __init__
    self._client = Client(remote_path)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/ext/transport/client.py", line 50, in __init__
    self.connect()
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/ext/transport/utils.py", line 18, in wrapper
    raise RuntimeError(error_message)
RuntimeError: Failed to connect to Aim Server. Have you forgot to run `aim server` command?

Do you have any clue as to why this is not working? Here is the docker-compose.yaml and the python file I'm using:

services:
  ui:
    image: aimstack/aim:3.20.1
    container_name: aim_ui
    restart: unless-stopped
    command: up --host 0.0.0.0 --port 43800 --dev
    ports:
      - 80:43800
    volumes:
    - ~/mycontainer/aim:/opt/aim
    networks:
      - aim

  server:
    image: aimstack/aim:3.20.1
    container_name: aim_server
    restart: unless-stopped
    command: server --host 0.0.0.0 --dev --ssl-keyfile /opt/secrets/server.key --ssl-certfile /opt/secrets/server.crt
    ports:
      - 53800:53800
    volumes:
    - ~/mycontainer/aim:/opt/aim
    - ~/secrets:/opt/secrets
    networks:
      - aim

networks:
  aim:
    driver: bridge
from aim import Run

# AIM_REPO='/home/azureuser/mycontainer/aim'
AIM_REPO='aim://REDACTED:53800'
AIM_EXPERIMENT='SSL-server'

run = Run(
    repo=AIM_REPO,
    experiment=AIM_EXPERIMENT
)


hparams_dict = {
    'learning_rate': 0.001,
    'batch_size': 32,
}
run['hparams'] = hparams_dict


# log metric
for i in range(30):
    if i % 5 == 0:
        i = i * 0.347
    run.track(float(i), name='numbers')

JeroenVranken avatar Jun 19 '24 13:06 JeroenVranken

@JeroenVranken thanks for the issue. This could be related to the auth token things we have added recently. @mihran113 @alberttorosyan what do you guys think?

SGevorg avatar Jun 21 '24 07:06 SGevorg

Any update on this ? I guess I faced a similar issue in #3206

schauaib avatar Aug 09 '24 11:08 schauaib

This error occurs with version 3.20.1, but everything works fine when I revert to the 3.17.4 version of AIM

schauaib avatar Aug 09 '24 12:08 schauaib

This error occurs with version 3.20.1, but everything works fine when I revert to the 3.17.4 version of AIM

Have you tried the latest version 3.23.0. I seem to be dealing with the same issue.

erikdao avatar Aug 27 '24 14:08 erikdao

@erikdao did you manage to resolve it? It seems in general Aim docs could be improved for making it into production. The docker-compose file is missing even in the repo, and there are a few docker / ssl related issues open for a long time..

merryHunter avatar Sep 22 '24 02:09 merryHunter

@erikdao did you manage to resolve it? It seems in general Aim docs could be improved for making it into production. The docker-compose file is missing even in the repo, and there are a few docker / ssl related issues open for a long time..

It turned out that my problem was different. I didn't enable SSL when deploying Aim Server. My problem was related to networking on GCP.

erikdao avatar Sep 22 '24 12:09 erikdao

Hey folks! Sorry for late response.

I've opened a PR which will add support for self-signed SSL certificates. The problem here was that by default requests package doesn't trust self-signed certificates and needs a custom cert file path to verify against, which renders our protocol probe logic (choosing between http and https for the client) obsolete and falls back to using http which results in the errors shared above.

This addition will allow to specify cert files path via env variable, which will allow the client flow to work as expected.

The changes will be included in the upcoming 3.25.0 release.

mihran113 avatar Sep 23 '24 14:09 mihran113

@mihran113 amazing, thank you very much!!

merryHunter avatar Sep 23 '24 15:09 merryHunter

Hey folks the changes for this have been published with the newest release of aim v3.25.0. Please check out the new section in documentation for client side setup (works the same way as with versions of 3.17.5 and older) https://aimstack.readthedocs.io/en/latest/using/remote_tracking.html#ssl-support

Let me know if everything works as expected or not 🙌

mihran113 avatar Oct 02 '24 17:10 mihran113