clearml icon indicating copy to clipboard operation
clearml copied to clipboard

[Feature request] add option to force storing of script code when running from inside a git repository

Open markus289 opened this issue 3 years ago • 30 comments

Currently, clearml has two approaches to get code to workers.

  1. If a git repository is detected then the worker clones this repository.
  2. Otherwise, the script code itself is stored.

This is a problem if git is only used locally, i.e., there is no remote associated to the local git repository in which the code resides. Then, when cloning the experiment and running it on a worker, the following error occures.

cloning: origin
Repository cloning failed: 'NoneType' object has no attribute 'startswith'
clearml_agent: ERROR: Failed cloning repository.

If, on the other hand, the script is executed and not part of a git repository, the script code is stored by clearml.

2021-04-27 09:45:14,196 - clearml.Task - INFO - No repository found, storing script code instead

Now, it would be nice to have control over this decision, i.e., being able to store the script code, even if a git repository is detected.

As was pointed out to me on Slack, it is possible to disable the repository autodetection.

task = Task.init(project_name="Your project name", task_name="Your task name",
                 auto_connect_frameworks={"detect_repository": False})

But then, the script is not stored.

markus289 avatar Apr 27 '21 09:04 markus289

Hi @markus289 ,

How do you think this feature should be active? as part of the auto_connect_frameworks (which is quite large) or something like the offline mode (Task.set_offline())?

JDennisJ avatar Apr 27 '21 16:04 JDennisJ

To be honest, I am using clearml since yesterday and the goal is to execute tasks on a set of remote machines. It was really difficult for me to figure out how my code even reaches the remote machines. It was trail and error to figure out what happens. I would have expected a discussion about the topic (code reaching remote machines) in the section about Fundamentals of the documentation.

In other words, I don't think I am qualified to suggest how this feature would fit into clearml.

So, what I would have expected is this: clearml ignores that my code lives in a git repository and creates a snapshot of the code to be submitted to the remote machine. Also, I would have also assumed that I could take a look at said snapshot of the code in the web interface. In my opinion this would be important to reproduce results, but may not be the goal of clearml in the first place.

To elaborate on the above mentioned use case (having the git repository only locally). This is not the complete picture. Currently, clearml always fails when the remote machine cannot reach the git repository. In our case, the remote machines are set up by another person who cannot add my ssh keys, such that clearml can check out the git repositories on the remote machine using SSH. A way forward would be to let clearml generates ssh keys for each project (?) which then can be added (read access) to the git repository (Github/Gitlab/whatever).

In any case, I am wondering why I seem to be the first person with these problems.

markus289 avatar Apr 27 '21 17:04 markus289

I also would like to make a suggestion in this regard, ClearML does not seem to automatically pick up new files if they are not added to Git (even if the repository is git repository), I would like to have an option to make ClearML pick up these new files through a command option even if they are not part of the Git Repo. For example, this could be really useful for local pretrained model weights.

Another suggestion would be to have a config option to ignore Git completely and just store entire project folder for experiment. For example, when launching a new experiment I'm using following script

clearml-task \
--project FoodDetection \
--name "Test" \
--docker nvcr.io/nvidia/pytorch:20.12-py3 \
--docker_bash_setup_script docker_setup_script.sh \
--docker_args="--shm-size=8g" \
--folder Swin-Transformer-Object-Detection \
--requirements requirements/build.txt \
--script tools/train.py \
--args config=configs/swin/scnet_swin_eren.py \
--queue default

In this command, I'm already telling ClearML which folder that I want to sync (--folder Swin-Transformer-Object-Detection). I would suggest in this case additional argument, for example, --no_git , that would disable the git system entirely and just store everything under project folder.

1st suggestion is useful when we want to run a new experiment with uncommitted changes that have a new file on them, but pre_trained model weights are better handled by treating them as extra dataset.

2nd suggestion would help a lot because it means we do not have to add Git credentials to clearml-agent's as well. This could be beneficial for new teams that try to use ClearML (it makes integration more straightforward, we just give it folder, python file to run and it runs without having to deal with all the Git stuff), and It could be useful for big companies that have multiple teams working for them that have more complex Git setups and they don't have a single Git account with access to all the repositories.

ErenBalatkan avatar Jun 21 '21 16:06 ErenBalatkan

Another suggestion would be to have a config option to ignore Git completely and just store entire project folder for experiment.

@markus289 and @ErenBalatkan unfortunately I'm not actually sure we have a "place" to store the entire folder snapshot. This basically means zipping the folder, uploading it, and making sure the agent can pull & unzip the file...

I would like to have an option to make ClearML pick up these new files through a command option even if they are not part of the Git Repo.

That might be possible, the reason clearml does not pick them, is because they were not added to the git repo, thus the git diff will not include them. Any thoughts on how to make git diff include uncommitted (not-added) files, without actually creating a local commit ?

bmartinn avatar Jun 21 '21 22:06 bmartinn

Wouldn't it be possible to treat project folders similar to the way datasets work?

I'm not sure about how git diff part could be handled.

ErenBalatkan avatar Jun 23 '21 13:06 ErenBalatkan

Wouldn't it be possible to treat project folders similar to the way datasets work?

Yes it would, but it also means we need to add support for the clearml-agent

I'm not sure about how git diff part could be handled.

It might be possible to combine the two and just pack everything that is not in the git repo into a zip file, then have the clearml-agent unpack everything, totally doable... Is this something users would want to have ? Don't we want to advocate for git ? guys, wdyt ?

bmartinn avatar Jun 25 '21 22:06 bmartinn

Yeah, Git support is already pretty good from what I have tested so far, but ability to run projects that are not connected to Git would be a huge plus for us.

Also, another idea came to my mind, would it be possible to have Git credential keys per-experiment instead of per-agent? This would mean that we do not have to have 1 account that has access to all repositories that we use for ML experiments and can just supply credentials from our own accounts per-experiment. I can imagine this would be a huge time saver for teams that frequently experiment with different repos and companies with high number of independent teams.

ErenBalatkan avatar Jun 28 '21 13:06 ErenBalatkan

... but ability to run projects that are not connected to Git would be a huge plus for us.

I think this is doable, would zipping the entire (really the entire, no filtering added) working directory work ? How would one specify the "root folder" ? (e.g. when the execution current directory is Not the root folder of the project)

Also, another idea came to my mind, would it be possible to have Git credential keys per-experiment instead of per-agent? ...

Yep, I'm pretty sure the paid version supports this kind of vault :) but for the open-source this is still under BYOS (bring your own security) approach, which makes things a lot easier for me :)

bmartinn avatar Jun 30 '21 22:06 bmartinn

I think this is doable, would zipping the entire (really the entire, no filtering added) working directory work ? How would one specify the "root folder" ? (e.g. when the execution current directory is Not the root folder of the project)

I think best approach would be to add another argument for root folder, this path would be relative to execution directory and if not provided execution directory could be the root instead.

I think zipping the entire project would work, it is not the most optimized approach but it will get the job done. A very simple git-ignore like filter could be useful for some teams that may only need 1 pretrained model on a directory filled with pre-trained models. But we are relying on clearml-data based approach to pre-trained models so that is not a problem in our case.

ErenBalatkan avatar Jul 01 '21 13:07 ErenBalatkan

Following the request in #395 for a more specific control over the data that will be packaged. We could try something like

task.package_repo(root_folder='.', include_files=['*.py', '*.txt'], exclude_files=['*.pyc'])

Also introduce a way to make this the default for any non git execution with the configuration file clearml.conf:

sdk.development.package_non_git.include_files: ['*.py', '*.txt']
sdk.development.package_non_git.exclude_files: ['*.pyc']

@idantene I'd rather have a more generic interface to control additional files, than to have an additional argument on the execute_remotely, I think it is less confusing (though I might be off here)

@ErenBalatkan wdyt ?

BTW: if you have better naming for the functions / parameters feel free to suggest :)

This might be a bit more challenging to implement on the agent side as by default it has no storage drivers, but let's see what we can do when we get there.

bmartinn avatar Jul 07 '21 21:07 bmartinn

Looks perfect for me. This would give us a ton more flexibility with experiments.

ErenBalatkan avatar Jul 08 '21 06:07 ErenBalatkan

The task.package_repo is a bit ambiguous IMO. Will packaged items appear on the experiment page, as if they're uploaded using task.upload_artifact? What if I just wanted to ship these off for remote execution but not log them to the experiment? I think this repository/package notation is anyway more tightly coupled with the execute_remotely specifically (since users can call task.upload_artifact or StorageManager.upload_folder, but these do not offer the "out of the box" solution for remote execution), so I don't see why it should be a separate call -- unless it serves other functions I may not be aware of :)

I think it would anyway be better to have the package repositories as a dictionary, so that (if we continue with the above example), we'll have:

task.add_repo(name, root_folder=..., include_files=..., exclude_files=...)
task.list_repos()
task.remove_repo(name)
task.update_repo(name, root_folder=..., include_files=..., exclude_files=...)

This also fits the notion of the package_data I've linked to in #395 (in setuptools, here).

idantene avatar Jul 08 '21 09:07 idantene

... as if they're uploaded using task.upload_artifact?

Basically the way I imagine it is, we zip everything (source code , data, etc.), then put it as an artifact on the Task. Then the agent will unzip the entire thing before running the actual code (this will allow it to store the data and the code).

What if I just wanted to ship these off for remote execution but not log them to the experiment?

Not sure I understand how you can ship it to a remote machine, but not log them on the Task? and actually why wouldn't you want to log it on the Task ?

I like the add_repo , maybe add_local_repo so it is obvious this is not supposed to be a git repository ? (I wouldn't start with update/remove, I can't see the immediate use case to add then modify a like later, unless you have something specific in mind here)

bmartinn avatar Jul 09 '21 00:07 bmartinn

From my perspective, stored code must be logged to ClearML task to ensure that every experiment is reproducable, currently saving Git commit already ensures this but for this alternative approach that does not use Git, code has to be saved on Task so that reproducability is not broken.

ErenBalatkan avatar Jul 09 '21 06:07 ErenBalatkan

(bump) Any updates on this? :)

Not sure I understand how you can ship it to a remote machine, but not log them on the Task? and actually why wouldn't you want to log it on the Task ?

@bmartinn - sorry, I only now noticed this. We have some legacy system that has ground truth CSV files and we haven't yet transitioned to clearml-data (we version it using DVC) - so there's no need to log it again, and multiple times, to each task that uses them. My workaround thus far has been to use the StorageManager to make it available for remote tasks.

idantene avatar Dec 30 '21 15:12 idantene

Hi @idantene ,

It's still on our TODO list - I'll try to push it forward 🙂

jkhenning avatar Dec 31 '21 10:12 jkhenning

Hi @idantene , the latest RC 1.1.6rc0 contains this commit that allows you to call Task.force_store_standalone_script() to force storing standalone script instead of using the git repository reference.

jkhenning avatar Jan 12 '22 07:01 jkhenning

Hey @jkhenning! Thanks for the notification; my original request was actually a bit different than this (see #395). We consolidated discussions here as @bmartinn suggested :)

tldr - my original feature request was to package additional files with the task:)

idantene avatar Jan 12 '22 07:01 idantene

@jkhenning @bmartinn should the discussion continue in #395 now that this issue is merged in 1.1.6?

idantene avatar Jan 26 '22 16:01 idantene

tldr - my original feature request was to package additional files with the task:)

Hmm yes, for that one we need to make sure we have a proper interface.

  1. This means storing additional files as artifact (e.g zipping them and uploading as artifact, this is already fully supported)
  2. Make sure the clearml-agent is aware of this "special" artifacts and unzips the content of the artifact into the target repo dir.

Am I missing something here?

Main challenge: The clearml-agent does not actually have any "Download" capabilities. For example assuming the artifact is stored on an S3 bucket, it means it needs to have boto3 preinstalled (which is not in the requirements list, and we do try to keep it minimal).

Thoughts ?

On a side note, why not use git for that, it was designed for exactly these things, no?! Am I missing something from the use case ?

bmartinn avatar Jan 28 '22 22:01 bmartinn

Am I missing something here?

I don't think you are :) I'm not aware of how things work right now, but I assume the agent downloads the requirement list (from pip freeze), git changes, etc, from somewhere?

Main challenge: The clearml-agent does not actually have any "Download" capabilities. For example assuming the artifact is stored on an S3 bucket, it means it needs to have boto3 preinstalled (which is not in the requirements list, and we do try to keep it minimal).

Thoughts ?

Adding boto3 as requirements seems fine to me (doesn't it have it anyway? otherwise why/how even provide S3/Azure/Google credentials in clearml.conf for the agent?). As mentioned, there are some preparatory steps already in the agent in which it at least seems to pull some data from the Task object; I imagine this should work similarly.

On a side note, why not use git for that, it was designed for exactly these things, no?! Am I missing something from the use case ?

No, git was not designed for these things :) These are not code objects we ship, but rather CLI arguments that refer to various files (i.e. config files, or local pre-existing models, etc). They are not meant for versioning, just for that specific run (and hence replication). They are not suitable to be a configuration object either, since they have some 700+ lines on average.

EDIT: So for example, under the assumption the clearml-agent's and the local clearml.conf both point to the same api.files_server, and have the required credentials if needed (S3, Azure, etc), then these files could be temporarily uploaded to a cache with the task id on the file server, and the agent downloads them before starting the run (for example, after pip install).

idantene avatar Jan 29 '22 07:01 idantene

They are not suitable to be a configuration object either, since they have some 700+ lines on average.

No real size limitation here, I think the actual limit is the entire Task object itself, and I think this is about 16MB. This means 700+ lines should not be a problem.

These are not code objects we ship, but rather CLI arguments that refer to various files (i.e. config files, or local pre-existing models, etc).

Oh that make sense, so you have git + config + models ? Would it make sense to use the model repository to upload/download those files? More specifically, the agent doesn't have any "download" capabilities, it uses git and pip etc. to install packages and get the code base (we try to keep the agent as "light" as possible, so it can be executed in any environment). On the other hand, clearml does have download capabilities and can be used to access model files / artifacts, this means that inside your code you download the needed artifacts, as opposed to pre-execution, done by the agent. Maybe using clearml-data to upload the additional files ? How does your specific use case work? (meaning what is the flow/order of steps you are performing)

bmartinn avatar Jan 31 '22 22:01 bmartinn

@bmartinn, sorry for being unclear. The "limit" here is not one of size, but of readability. The configuration tab fits perfectly for small, flat dictionary structure (some nested levels are fine). With 700+ lines and multiple levels of nesting, using it to store a full config file makes the file unintelligible and removes most, if not all, readability.

The workaround we have now is given below, and I believe it can be implemented within clearml-agent directly. While keeping it light, I don't think boto3 and friends are a large requirement to add to a python code - do you? It can even be a dynamic requirement, based on the agent's configuration file.

We first determine a cache location based on a project+task name either given in the configuration file (when running locally), or from the remote task metadata (when running remotely). In our case this is a MinIO S3 bucket. We then had to add some pre_init() and post_init() methods to our code, so it roughly looks like this (pseudo-code):

pre_init()
# set up argument parser
# load dotenv file, verify config file integrity
post_init()
# normal run as expected

Then, the pre_init() has:

if Task.running_locally():
    return
# Remote execution
# Set basic environment variables from remote task
# Download cache directory locally
# Clear cache directory (using boto3) and obfuscate remote basic environment variables

And the post_init():

if not Task.running_locally():
    return
# Prepare for remote execution
task = Task.init(...)
# Set basic environment variables in task
# upload configuration file(s), model file(s), etc as needed to cache using StorageManager
# Switch to remote execution

We also included a small cleanup for the remote execution that then clears up the agents local storage.

I can probably share non-pseudo-code bits if this is insufficient.

idantene avatar Feb 01 '22 07:02 idantene

@idantene Wondering about where you say "...using it to store a full config file makes the file unintelligible and removes most, if not all, readability" - In which way does readability come into play? From the discussion so far, it sounds like the main driver is programmatic access to the document(s), and with the interface being considered (artifacts) there's no extra readability maintained over configuration objects?

ainoam avatar Feb 01 '22 16:02 ainoam

@ainoam Readability comes into play in debugging and understanding a Task's inner workings.

idantene avatar Feb 01 '22 16:02 idantene

@idantene Wholly agree. But how does an artifacts interface improve readability on the configuration objects one?

ainoam avatar Feb 01 '22 16:02 ainoam

@ainoam the preview window over the configurable UI, essentially.

idantene avatar Feb 01 '22 17:02 idantene

Hey @bmartinn @ainoam

Any updates on this? 🤔

idantene avatar Jul 14 '22 07:07 idantene

Thanks for pinging @idantene ! I think this issue kind of expanded from the original request (which is already implemented) Specifically I "think" that you are asking about:

  • This means storing additional files as artifact (e.g zipping them and uploading as artifact, this is already fully supported)
  • Making sure the clearml-agent is aware of this "special" artifacts and unzips the content of the artifact into the target repo dir.

The main issue with storing additional files on the Task is cloning Tasks. Basically if everything is stored on the Task object, then creating a copy of the Task object, copies everything we need to run it. But if we add additional files as artifacts, when cloning the Task, the links to the artifacts are not cloned (because usually artifacts are the output of the Task execution not input). Now, there is some support for artifacts of type "input" (at least in theory), and we might be able to use this kind of artifact and when cloning Tasks also clone the Input artifact. But the caveat is that the link will point to the original Task, and this Task might get deleted which would break Any task that would have been cloned from it. wdyt?

bmartinn avatar Jul 14 '22 22:07 bmartinn

@bmartinn

But the caveat is that the link will point to the original Task, and this Task might get deleted which would break Any task that would have been cloned from it.

If that's the case, then it is the user fault, not ClearML (or offer a "copy input files" checkbox when cloning a task, causing them to be replicated). Anyway this sounds great. We've have so many different workarounds for this now...

idantene avatar Sep 08 '22 14:09 idantene