clearml-agent
clearml-agent copied to clipboard
Does trains-agent caches experiments envs?
Context
Hi,
Most of the time (99%), we send tasks to trains-agent with changes in code, but no changes in requirements (the environment does not change). We would expect that the environment (venv) is cached and reused between different experiments, to spare us the installation time (5-10 mins), so that we can iterate faster.
Problem
- I tried to run the artifact_toy example locally.
- I then cloned the experiment in the UI, reset it and sent it to queue again
- Wait for the experiment to finish, and repeat previous step.
Therefore the task is executed two times in the same trains-agent.
Actual behavior
The logs below show the execution trace of the second run in the agent. As you can see:
- Git repo was successfully cached and reused
- Packages were successfully cached and reused
- Environment, although being the same, was not cached and reused. It was reinstalled.
Expected behavior
Since the task runs a second time with the same environment, I would expect trains-agent to reuse it, saving me the time to install it. I would expect that trains-agent creates a hash of the environment from the list of requirements (on task creation, not after tasks finished, so that it can match another draft task requirements) and reuse the same env if a new task has the same hash.
Logs
2020-05-19T07:41:06.320Z instance-2:0 INFO task 6a5045e5c3b74afb892f85986b655218 pulled from 672de23dcf4b456590e150a2d3e3d002 by worker instance-2:0
2020-05-19T07:41:11.394Z instance-2:0 DEBUG Current configuration (trains_agent v0.14.1, location: /tmp/.trains_agent.uzshvg_e.cfg):
----------------------
agent.worker_id = instance-2:0
agent.worker_name = instance-2
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = <21
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = defaults
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = pytorch
agent.venvs_dir = /home/user/.trains/venvs-builds
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /home/user/.trains/vcs-cache
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /home/user/.trains/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.default_docker.image = nvidia/cuda
agent.git_user =
agent.default_python = 3.7
agent.cuda_version = 0
agent.cudnn_version = 0
sdk.storage.cache.default_base_dir = ~/.trains/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff_on_train = true
sdk.development.support_stopping = true
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
Executing task id [6a5045e5c3b74afb892f85986b655218]:
****
entry_point = artifact_toy.py
working_dir = .
Warning: could not locate requested Python version 3.6, reverting to version 3.7
Using base prefix '/usr'
New python executable in /home/user/.trains/venvs-builds/3.7/bin/python3.7
Also creating executable in /home/user/.trains/venvs-builds/3.7/bin/python
Installing setuptools, pip, wheel...
done.
Using cached repository in "/home/user/.trains/vcs-cache/my-repo.git.32940bc4e1fe7ef7cdafd7e48f8cf5db/my-repo.git"
2020-05-19T07:41:16.437Z instance-2:0 DEBUG Note: checking out '7e3592afc8e0d3cd7b7c02a3672b558aed8c675c'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:
git checkout -b <new-branch-name>
HEAD is now at 7e3592a add toy
type: git
url: https://github.com/H4dr1en/my-repo.git
branch: HEAD
commit: 7e3592afc8e0d3cd7b7c02a3672b558aed8c675c
root: /home/user/.trains/venvs-builds/3.7/task_repository/my-repo.git
Requirement already up-to-date: pip<21 in /home/user/.trains/venvs-builds/3.7/lib/python3.7/site-packages (20.1)
Collecting Cython
Using cached Cython-0.29.18-cp37-cp37m-manylinux1_x86_64.whl (2.0 MB)
Installing collected packages: Cython
Successfully installed Cython-0.29.18
Requirement already satisfied: Cython==0.29.18 in /home/user/.trains/venvs-builds/3.7/lib/python3.7/site-packages (0.29.18)
2020-05-19T07:41:21.481Z instance-2:0 DEBUG Collecting numpy==1.16.2
Using cached numpy-1.16.2-cp37-cp37m-manylinux1_x86_64.whl (17.3 MB)
Installing collected packages: numpy
Successfully installed numpy-1.16.2
Collecting attrs==19.3.0
Using cached attrs-19.3.0-py2.py3-none-any.whl (39 kB)
Collecting boto3==1.12.39
Using cached boto3-1.12.39-py2.py3-none-any.whl (128 kB)
Collecting botocore==1.15.49
Using cached botocore-1.15.49-py2.py3-none-any.whl (6.2 MB)
2020-05-19T07:41:26.525Z instance-2:0 DEBUG Collecting certifi==2020.4.5.1
Using cached certifi-2020.4.5.1-py2.py3-none-any.whl (157 kB)
Collecting chardet==3.0.4
Using cached chardet-3.0.4-py2.py3-none-any.whl (133 kB)
Requirement already satisfied: Cython==0.29.18 in /home/user/.trains/venvs-builds/3.7/lib/python3.7/site-packages (from -r /tmp/cached-reqsszq7heni.txt (line 6)) (0.29.18)
Collecting docutils==0.15.2
Using cached docutils-0.15.2-py3-none-any.whl (547 kB)
Collecting funcsigs==1.0.2
Using cached funcsigs-1.0.2-py2.py3-none-any.whl (17 kB)
Collecting furl==2.1.0
Using cached furl-2.1.0-py2.py3-none-any.whl (20 kB)
Processing /home/user/.cache/pip/wheels/8b/99/a0/81daf51dcd359a9377b110a8a886b3895921802d2fc1b2397e/future-0.18.2-cp37-none-any.whl
Collecting humanfriendly==8.2
Using cached humanfriendly-8.2-py2.py3-none-any.whl (86 kB)
Collecting idna==2.9
Using cached idna-2.9-py2.py3-none-any.whl (58 kB)
Collecting importlib-metadata==1.6.0
Using cached importlib_metadata-1.6.0-py2.py3-none-any.whl (30 kB)
Collecting jmespath==0.10.0
Using cached jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Collecting jsonmodels==2.4
Using cached jsonmodels-2.4-py2.py3-none-any.whl (20 kB)
Collecting jsonschema==3.2.0
Using cached jsonschema-3.2.0-py2.py3-none-any.whl (56 kB)
Requirement already satisfied: numpy==1.16.2 in /home/user/.trains/venvs-builds/3.7/lib/python3.7/site-packages (from -r /tmp/cached-reqsszq7heni.txt (line 17)) (1.16.2)
Collecting orderedmultidict==1.0.1
Using cached orderedmultidict-1.0.1-py2.py3-none-any.whl (11 kB)
Collecting pandas==1.0.3
Using cached pandas-1.0.3-cp37-cp37m-manylinux1_x86_64.whl (10.0 MB)
Collecting pathlib2==2.3.5
Using cached pathlib2-2.3.5-py2.py3-none-any.whl (18 kB)
Collecting Pillow==6.2.1
Using cached Pillow-6.2.1-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
Collecting plotly==4.7.1
Using cached plotly-4.7.1-py2.py3-none-any.whl (11.5 MB)
Processing /home/user/.cache/pip/wheels/b6/e7/50/aee9cc966163d74430f13f208171dee22f11efa4a4a826661c/psutil-5.7.0-cp37-cp37m-linux_x86_64.whl
Collecting PyJWT==1.7.1
Using cached PyJWT-1.7.1-py2.py3-none-any.whl (18 kB)
Collecting pyparsing==2.4.7
Using cached pyparsing-2.4.7-py2.py3-none-any.whl (67 kB)
Processing /home/user/.cache/pip/wheels/22/52/11/f0920f95c23ed7d2d0b05f2b7b2f4509e87a20cfe8ea43d987/pyrsistent-0.16.0-cp37-cp37m-linux_x86_64.whl
Collecting python-dateutil==2.8.1
Using cached python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Collecting pytz==2020.1
Using cached pytz-2020.1-py2.py3-none-any.whl (510 kB)
Processing /home/user/.cache/pip/wheels/5e/03/1e/e1e954795d6f35dfc7b637fe2277bff021303bd9570ecea653/PyYAML-5.3.1-cp37-cp37m-linux_x86_64.whl
Collecting requests==2.23.0
Using cached requests-2.23.0-py2.py3-none-any.whl (58 kB)
Collecting requests-file==1.5.1
Using cached requests_file-1.5.1-py2.py3-none-any.whl (3.7 kB)
Processing /home/user/.cache/pip/wheels/d7/a9/33/acc7b709e2a35caa7d4cae442f6fe6fbf2c43f80823d46460c/retrying-1.3.3-cp37-none-any.whl
Collecting s3transfer==0.3.3
Using cached s3transfer-0.3.3-py2.py3-none-any.whl (69 kB)
Collecting six==1.14.0
Using cached six-1.14.0-py2.py3-none-any.whl (10 kB)
Collecting tqdm==4.46.0
Using cached tqdm-4.46.0-py2.py3-none-any.whl (63 kB)
Collecting trains==0.14.3
Using cached trains-0.14.3-py2.py3-none-any.whl (550 kB)
Collecting typing==3.7.4.1
Using cached typing-3.7.4.1-py3-none-any.whl (25 kB)
Collecting urllib3==1.25.9
Using cached urllib3-1.25.9-py2.py3-none-any.whl (126 kB)
Collecting zipp==3.1.0
Using cached zipp-3.1.0-py3-none-any.whl (4.9 kB)
Requirement already satisfied: setuptools in /home/user/.trains/venvs-builds/3.7/lib/python3.7/site-packages (from jsonschema==3.2.0->-r /tmp/cached-reqsszq7heni.txt (line 16)) (46.4.0)
2020-05-19T07:41:31.569Z instance-2:0 DEBUG Installing collected packages: attrs, jmespath, six, python-dateutil, docutils, urllib3, botocore, s3transfer, boto3, certifi, chardet, funcsigs, orderedmultidict, furl, future, humanfriendly, idna, zipp, importlib-metadata, jsonmodels, pyrsistent, jsonschema, pytz, pandas, pathlib2, Pillow, retrying, plotly, psutil, PyJWT, pyparsing, PyYAML, requests, requests-file, tqdm, typing, trains
2020-05-19T07:41:46.654Z instance-2:0 DEBUG Successfully installed Pillow-6.2.1 PyJWT-1.7.1 PyYAML-5.3.1 attrs-19.3.0 boto3-1.12.39 botocore-1.15.49 certifi-2020.4.5.1 chardet-3.0.4 docutils-0.15.2 funcsigs-1.0.2 furl-2.1.0 future-0.18.2 humanfriendly-8.2 idna-2.9 importlib-metadata-1.6.0 jmespath-0.10.0 jsonmodels-2.4 jsonschema-3.2.0 orderedmultidict-1.0.1 pandas-1.0.3 pathlib2-2.3.5 plotly-4.7.1 psutil-5.7.0 pyparsing-2.4.7 pyrsistent-0.16.0 python-dateutil-2.8.1 pytz-2020.1 requests-2.23.0 requests-file-1.5.1 retrying-1.3.3 s3transfer-0.3.3 six-1.14.0 tqdm-4.46.0 trains-0.14.3 typing-3.7.4.1 urllib3-1.25.9 zipp-3.1.0
Running task id [6a5045e5c3b74afb892f85986b655218]:
[.]$ /home/user/.trains/venvs-builds/3.7/bin/python -u artifact_toy.py
Summary - installed python packages:
pip:
- attrs==19.3.0
- boto3==1.12.39
- botocore==1.15.49
- certifi==2020.4.5.1
- chardet==3.0.4
- Cython==0.29.18
- docutils==0.15.2
- funcsigs==1.0.2
- furl==2.1.0
- future==0.18.2
- humanfriendly==8.2
- idna==2.9
- importlib-metadata==1.6.0
- jmespath==0.10.0
- jsonmodels==2.4
- jsonschema==3.2.0
- numpy==1.16.2
- orderedmultidict==1.0.1
- pandas==1.0.3
- pathlib2==2.3.5
- Pillow==6.2.1
- plotly==4.7.1
- psutil==5.7.0
- PyJWT==1.7.1
- pyparsing==2.4.7
- pyrsistent==0.16.0
- python-dateutil==2.8.1
- pytz==2020.1
- PyYAML==5.3.1
- requests==2.23.0
- requests-file==1.5.1
- retrying==1.3.3
- s3transfer==0.3.3
- six==1.14.0
- tqdm==4.46.0
- trains==0.14.3
- typing==3.7.4.1
- urllib3==1.25.9
- zipp==3.1.0
Environment setup completed successfully
Starting Task Execution:
TRAINS results page: http:/a.b.c.d.e:8080/projects/fc7cc6cc167f4763ae35eb27e1bfff2b/experiments/6a5045e5c3b74afb892f85986b655218/output/log
2020-05-19T07:41:51.694Z instance-2:0 DEBUG num_legs num_wings num_specimen_seen
falcon 2 2 10
dog 4 0 2
spider 8 0 1
fish 0 0 8
Done
[train]: shape=(4, 3), 4 unique rows, 100.0% uniqueness
TRAINS Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
Hi @H4dr1en I can definitely feel you on this one :) So we used to use venv_update , in theory you can still try to use it (but I have to honest I'm not sure on its status)
Actually we are working on accelerating pip install
, in this issue you can see the full potential , and the initial PR.
I'm hoping that after 21.1 is released we will be able to merge all our improvements.
Feel free to join the discussion there :)
The idea is that the safest way to restore an environment is to recreate it, (just imagine something goes wrong it reuses the venv and from time to time something is a bit different, or you think you are getting the same environment, but your are not...)
And since everything is cached, and pip has no real dependencies to solve (think the seconds time, where all the packages are fixed, after a pip freeze
to the initial venv) , there is no reason why the unzip should not be a few seconds, after all the these GPU machines are usually fast enough to handle a few file unzipping...
Actually we are working on accelerating pip install, in this issue you can see the full potential , and the initial PR. I'm hoping that after 21.1 is released we will be able to merge all our improvements. Feel free to join the discussion there :)
Kudos for the great work 🥇 Looks very promising!
The idea is that the safest way to restore an environment is to recreate it, (just imagine something goes wrong it reuses the venv and from time to time something is a bit different, or you think you are getting the same environment, but your are not...)
This is true in general, but in the specific case where a user wants to rerun an experiment on the same machine, no one would do that: the user simply starts the experiment again in the same environment. This would be very valuable in trains, because even when caching all the wheels and reinstalling without solving any dependencies, the installation is still very slow when you deal with big libraries like pytorch, opencv, scipy, etc.
We are talking about 5 to 10 mins, even on a competitive machine, to rebuild an environment that already was already built on the machine. IMO this is an actual need to should be addressed, because reusing a previous environment shouldn't be difficult to achieve.
Why? Because most of the time, researchers have a lot of experiments, but only a small number of environments and it would be very convenient to attach the same environment to multiple experiments, therefore reducing the deployment time to 0. This would be a killer feature.
How I would see it:
Proposal 1
Trains agents take care of everything:
- Do not delete envs after experiments are finished
- Create an internal store of (hash of env, env location)
- if an agent pulls a task having requirements that matches an hash in its store, it uses that env.
- Provide this feature as a
agent.cache_envs
parameter to the user. Users knowing that they won't change the environment during they experiments (99% of users) can use this parameter.
Proposal 2
- Decouple environments from experiments: Users can create environments from the web UI/Python API and manage them (create/clone/delete/update list of requirements/package versions, ...)
- Users can link environments to experiments: When creating/editing a task, users can specify which environment they want to use for one experiment (via the unique ID of the environment).
- Keep the flexibility of the current implementation: Environments can be created on-the-fly when creating a task.
- Have a programmatic access to these environments: One could do:
my_task = Task(...) # Create task with new env, run vcs detection. Update new env.
my_task = Task(..., environment_id=...) # Create task with already existing environment
Hi @H4dr1en I think that "Proposal 2" is something you can already achieve. This is basically building a docker , and using it as the base docker image.
trains-agent build --docker nvidia/cuda --id aa11bb22 --target my_new_env_docker
This command will take experiment id "aa11bb22" and will build a docker including everything installed in it based on the environment defined in the experiment.
Now you can use the newly created base docker ("my_new_env_docker") as the base docker for all your experiment. Basically what happens is the environment is installed as the "system" environment inside the docker, and every venv created inherits the packages. This means everything is preinstalled, but still gives you the possibility to change package versions, if needed. What do you think ?
Regrading "proposal 1" it makes sense only if we hash the environment requirements, and the question is how many venv's we cache. This is doable but might require some work, it also might be a bit more complicated to share the venvs if you are running multiple agents on the same machine. My fear is actually stability, it will be quite bad if from time to time you will be getting the wrong venv, or venv with leftovers...
Hi, I am on the same bandwagon and tried proposal 2 by setting up my own docker environment. I need this solution specifically because I have to use nvidia-dali for fast pre-processing. However nvidia dali requires the following command to be installed:
pip install nvidia-dali==0.21.0 --extra-index-url https://developer.download.nvidia.com/compute/redist/cuda/10.0
However, as mentioned in issues section in trains-agent 'pip freeze' does not capture --extra-index-url.
I also need to install horovod, which also requires some previous steps. I managed to build this docker and run it using the following command:
trains-agent build --docker name_of_docker--id 41672b8... --target trains_docker
It builds and shows as a worker in workes&queues section with following errors:
trains_agent: ERROR: Could not parse task execution info: 'Tasks' object has no attribute 'script'
trains_agent: ERROR: 'NoneType' object has no attribute 'id'
bash: /root/trains.conf: Permission denied
bash: /root/trains.conf: Permission denied
And when I try to run enqueu a task I get following error, naturally
trains_agent: ERROR: Could not find task id=05d03ebb905840279336ab57f6b69ac8 (for host: )
Exception: 'Tasks' object has no attribute 'id'
I have attached the following log file from results section. task_a5df428d97454314b0e56d66f3135fca.log I am also adding following log file for agent building section.
Lastly, adding Dockerfile in case someone wants to use that. I learned how to use Docker in a week so there might also be something going wrong there.
Hi @Mert-Ergin
A few remarks, before answering your question :)
- Did you add the
extra_index_url
to the~/trains.conf
? As you can see here we support having multiple indexes for the exact reason you mentioned. - Horovod is one of the special cases trains-agent takes care of. Basically it will always get installed last after all the rest of the requirements are installed, this is due to the fact that Horovod installs different flavours based on the pytorch/tensorflow installed in the system.
Regrading the error:
- What's trains-agent version you are using (both for building the docker and for running it)
- This error is basically saying there is no Task with the requested ID, which is probably because it is missing permissions to your server (and by default will try the demo-server)
- How did you get the error , are you running the docker or using it as "base docker image" for a specific experiment?
- Just making sure, are you running trains-agent in docker mode ?
Lastly, adding Dockerfile in case someone wants to use that. I learned how to use Docker in a week so there might also be something going wrong there.
:+1: nice :)
Hi,
I'm updating here that the latest version of clearml-agent
now includes venv caching capabilities 🎉 🎊
Add this section to your ~/clearml.conf
file on the agent's machine
agent {
# cached virtual environment folder
venvs_cache: {
# maximum number of cached venvs
max_entries: 10
# minimum required free space to allow for cache entry, disable by passing 0 or negative value
free_space_threshold_gb: 2.0
# unmark to enable virtual environment caching
path: ~/.clearml/venvs-cache
},
}
Reference here: https://github.com/allegroai/clearml-agent/blob/22d5892b12efa2acde304658ad0f08594b3e4ce6/docs/clearml.conf#L93
And upgrade and restart the clearml-agent
pip install clearml-agent==0.17.2rc2
This is awesome, thanks a lot @bmartinn and the team!! I am testing that right away 🤩