clearml-agent
clearml-agent copied to clipboard
Image on Docker Hub is out of date
I'm just getting started with clearml (learning the ropes). Per the README section describing Kubernetes integration I tried using the image found on dockerhub, running it outside of k8s: docker run --gpus all -it --rm -v $HOME/clearml-agent.conf:/clearml.conf -v /var/run/docker.sock:/var/run/docker.sock --network clearml_backend --user root
. (clearml_backend
comes from https://github.com/allegroai/clearml-server/blob/702b6dc9c804165b192a042253ad1d1690c5f0ed/docker/docker-compose.ymland
clearml-agent.confwas created by
clearml-agent init`).
The output of this command is just: CLEARML_AGENT_UPDATE_VERSION =
and the worker does not register. clearml-agent
appears to be version 0.17.1 FWIW.
Then I noticed that the image was last updated about 3 years ago. Upgrading the clearml-agent
package using pip install --upgrade clearml-agent
and bind-mounting the configuration file in the /root
directory resolved the problem, however I'm sure there'll be a lot of other issues when using such an old base image (e.g., old CUDA).
I think this might just be a matter of updating the dockerfile to pin a version of nvidia/cuda
(base image) and pushing to hub.
Hi @dpkirchner , the link you provided does not seem to work - I didn't quite understand which image you used
My bad, I added an extra backtick in the link: https://github.com/allegroai/clearml-server/blob/702b6dc9c804165b192a042253ad1d1690c5f0ed/docker/docker-compose.yml
The image I used was linked from here: https://github.com/allegroai/clearml-agent/blob/c9fc092f4eea9c3890d582aa2a098c3c2f39ce72/README.md#kubernetes-integration-optional (scroll down to Spin ClearML-Agent as a long-lasting service pod
).
Oh, I see it now. Honestly I think we should remove this option - this option basically spawns tasks as processes inside the agent's pod, which is not a good pattern in k8s - I would recommend using the helm chart
I see, ok. I'll check out the helm chart. Thanks.
It looks like the docker container used by the helm chart is also out of date -- it's running clearml-agent 1.2.4rc3 and using python 3.6. The image that is closest to being up to date is allegroai/clearml:1.14.0-431
, however you'll need to install docker and the clearml-agent python package to use it, and it's still a bit out of date.
Through experimentation I've found that if you want to use the latest version, you can check out https://github.com/allegroai/clearml-agent, go to the docker/agent
directory and edit Dockerfile
, replace FROM nvidia/cuda
with FROM nvidia/cuda:12.0.0-devel-ubuntu22.04
(can't use 12.3.1 because of a cuda-related bug in nvidia's image), and then build the image locally (I'm using docker build -t clearml-agent:latest .
in the docker/agent
directory). Following these steps will get you version 1.7.0.
I'm reopening because I'm not sure if this is all intended -- is the allegro/clearml-agent
docker image deprecated in general?
(I should note that the clearml-agent build
command run in this image does not result in a docker image, but I think that's unrelated, and something to be tracked in a different issue.)
Hi @dpkirchner,
The docker image used by clearml-helm-charts/clearml-agent chart is indeed pretty old (we're supposed to update it soon) and it's the allegroai/clearml-agent-k8s-base image. However, it is not related to the allegroai/clearml-agent image
Hi @dpkirchner, Do you have an info about the docker image update on the docker HUB ? There is a lot of outdated elements in it like the "k8s_glue_example.py" not taking list of queues for example
I cannot find a proper way to build the image even with the 'docker' folder from the repository, is that possible to provide a README to build it in local ?
I wasn't able to figure out how to use clearml properly, unfortunately, so I moved on to another project.
@dpkirchner Frankly, I have been hopping into different kinds of MLOps started with airflow + mlflow but it lack dataset versioning. So i moved to clearml and we use k8s (EKS) for most of our ETL pipelines. So I deploy clearml-server which works fine but now I have tried to deploy clearml-agent in cluster but it seems having issues with accessing api server
clearml_agent.backend_api.session.session.LoginError: Failed getting token (error 401 from https://api.clear.ml): Unauthorized (invalid credentials) (failed to locate provided credentials)
As the clearml documentations are not clear about helm charts deployment, it's really hard to understand the code and do PRs.
@thomsmoreau As I can see there a folder k8s-glue which seems have a various versions of docker images. Based on your cloud you can modify Dockerfile and update the outdated packages.
Note: During the build you have to modify/ add clearml.conf with your credentials as per the Dockerfile
script.
I am not a fan of putting credentials into the docker image build but at the same time helm chart value has an option to pass the credentials as a secret which is not working now.
In terms of passing list of queues for k8s_glu_example.py
you can pass it as 'queue1,queue2'
check here in values.json
of helm chart make sure there won't be any spaces between the strings.
@surya9teja I belived that the k8s_glu_example.py
file into the docker image was up to date but it is not. The version of it into the docker image provided by the chart does not take into account the separator "," into the string (passed by the "queue"argument) so I had to update it manually, firstly by doing a curl on the raw link you provided (I pulled the chart and changed the templates manually) and then by building a custom image for my company into which I just changed the script and it works fine ! . I did it about a month ago
Since then I didn't check for updates on the docker images but I think we can have better outputs in terms of udated content and performances if devs could push themselves an update.
Thank for your message, I should have commented earlier to maybe help other people stuck as I was
@jkhenning Do you have any info about the update of the chart with an up to date docker image ?