clearml-session
clearml-session copied to clipboard
Agent fails to install SSH server when running in venv/Conda
I've followed fairly straightforward steps to install a ClearML agent and connect to it using clearml-session
, but get the following output:
Installing SSH Server on ip-172-31-4-42 [172.31.4.42]
Unable to load host key "/home/ubuntu/.clearml/venvs-builds/3.8/code/.clearml_session_sshd/ssh_host_rsa_key.pub": invalid format
Unable to load host key: /home/ubuntu/.clearml/venvs-builds/3.8/code/.clearml_session_sshd/ssh_host_rsa_key.pub
Unable to load host key "/home/ubuntu/.clearml/venvs-builds/3.8/code/.clearml_session_sshd/ssh_host_ecdsa_key.pub": invalid format
Unable to load host key: /home/ubuntu/.clearml/venvs-builds/3.8/code/.clearml_session_sshd/ssh_host_ecdsa_key.pub
Unable to load host key: /home/ubuntu/.clearml/venvs-builds/3.8/code/.clearml_session_sshd/ssh_host_ed25519_key.pub
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
On the client side I then get:
Password: Error: incorrect password
Please enter password manually:
Any suggestions? Would the recommendation to be to install/run the ClearML agent as root and/or using the system Python?
Steps to reproduce
On the agent:
# System: Ubuntu Focal 20.04, AMD64
# Install Miniconda, then
conda create -n clearml python=3.8
pip install clearml-agent
clearml-agent init
# Copy/paste credentials obtained from ClearML server
clearml-agent daemon --queue default --foreground
Then on the client:
clearml-session --public_ip true
# {
# "base_task_id": null,
# "git_credentials": false,
# "jupyter_lab": true,
# "password": "<long random-looking password>",
# "public_ip": true,
# "queue": "default",
# "vscode_server": true
#}
Hi @norrishd ,
Thanks for the details - I'll try to reproduce and update as soon as possible!
Hi @norrishd The agent has no permissions to install the SSH server when running inside venv/conda. I'm not sure how we can support it without having root access for it. If an SSH daemon is already installed, it should be able to spin a second copy of it. wdyt?
Thanks for the explanation @bmartinn! So do you mean that the current version is able to spin a second SSH daemon? (assuming there's an SSH daemon installed). If so that's very cool and would be fine (I must just be doing something wrong)
I tend to use venvs for everything just to avoid ever messing with the system Python. But I guess the use case for clearml-agent is that it's intended to run on servers (or in containers?) that are reserved for that purpose, so the recommendation is to use the system Python and install necessary packages there?
Would you also recommend running it as sudo
, or does it not need that level of privileges?
If so that's very cool and would be fine (I must just be doing something wrong)
Yes, at least in theory (if this doesn't work and /usr/sbin/sshd
is still preinstalled, let me know what's the setup, it might be we are missing something)
... But I guess the use case for clearml-agent is that it's intended to run on servers
The agent itself can be installed on a venv (even though it might be easier to install system wide).
The issue is the process the agent spins, I.e. when the agent gets a Job (a Task) it can either, create a new temporary venv for the Task install everything the Task needs there, spin the process and leave. Or it can spin a container for the Task, then repeat the same process (venv creation) inside the container.
When the agent is used to spin the clearml-session
usually the setup i s the agent is running in docker mode (i.e. with the flag --docker
, then it spins all jobs inside a container, including the clearml-session
's remote interactive session.
Make sense?
Yep makes total sense, thanks 😁