dvc icon indicating copy to clipboard operation
dvc copied to clipboard

`exp run`: copy (certain) git-ignored files to tmp folder on `--run-all`

Open jdonzallaz opened this issue 3 years ago • 39 comments

UPDATE: Jump to https://github.com/iterative/dvc/issues/5800#issuecomment-819368076

When running experiments with --run-all (with some experiments in the --queue), the repo is copied to the .dvc/tmp/ for each experiment to run. The problem is that the git-ignored files are not copied to this folder, and the commands that depend on those files fail. The files that are git-ignored but are defined as stage dependencies should be copied. Use-case is the following: Some config files containing sensible data (API keys, password) are not pushed to the git repo but are still needed by certain commands.

See discord discussion. CC: @shcheklein @pmrowla

jdonzallaz avatar Apr 12 '21 15:04 jdonzallaz

Hey @jdonzallaz per #5029 it's possible to include untracked files in tmp exp dirs by staging them before queueing the experiment. So I think you could try something like

$ git add --force secret.txt
$ dvc exp run --queue
$ git remove secret.txt

⚠️ This would be a workaround! And it may have unwanted consequences like the secrets ending up in Git history.

Let's see what Engineering thinks of this one 🙂

jorgeorpinel avatar Apr 13 '21 02:04 jorgeorpinel

I don't think that workaround is viable here given that the request here is specifically regarding:

Some config files containing sensible data (API keys, password)

And if an experiment is run this way, those API keys and password will end up in git history

pmrowla avatar Apr 13 '21 02:04 pmrowla

Do we need a separate issue to specifically prevent this workaround? B/c if I thought about it someone else may figure it out and it can lead to some hard-to-predict security risks.

jorgeorpinel avatar Apr 13 '21 03:04 jorgeorpinel

There's not really anything we can do to prevent people from doing this, users can add whatever they want into git.

I think anyone using git add --force on a file containing secret credentials should already be aware that this will add that file into git, since that's what git add --force does by definition?

pmrowla avatar Apr 13 '21 03:04 pmrowla

that's what git add --force does by definition?

Not without git commit. In this scenario people would stage the secrets specifically because we copy staged (not committed) files into tmp exp dirs.

I keep coming back to the idea of not using Git as a proxy UI to DVC's behavior, but that's discussed is in #5801.

BTW, from https://github.com/iterative/dvc/issues/5801#issuecomment-818364698 @pmrowla:

in theory, stuff like (properly gitignored) venvs also needs to be copied over

Feels like too much. Is it common to have something like venv/ as a stage dependency? Or are we expanding the scope here (looking ahead)? Thanks

jorgeorpinel avatar Apr 13 '21 04:04 jorgeorpinel

Not without git commit. In this scenario people would stage the secrets specifically because we copy staged (not committed) files into tmp exp dirs.

And anything that is staged will be included in the experiment commit, so that it can be reproduced again later.

Feels like too much. Is it common to have something like venv/ as a stage dependency? Or are we expanding the scope here (looking ahead)? Thanks

I don't think it would be listed as a dependency, but some user stages may still depend on that directory existing. For example, I could have a pipeline stage with the command set to .venv/bin/python my_script.py

pmrowla avatar Apr 13 '21 05:04 pmrowla

@jdonzallaz For your particular use case, how important is it to have the credentials files as relative paths within your repo? Are there workarounds for you like using environment variables, absolute paths, or making paths relative to your home dir?

This isn't meant to dismiss the issue but to gather some more context about when and why it's important to have files like this accessible as gitignored files within the repo itself.

dberenbaum avatar Apr 13 '21 17:04 dberenbaum

pipeline stage with the command set to .venv/bin/python my_script.py

Any script used in cmd run should be listed under deps right? In this case pretty tricky as .venv/ is probably Git or DVC ignored, so would DVC even cache it? Interesting Q for sure, but it feels like a separate issue/discussion (we could open it though).

jorgeorpinel avatar Apr 14 '21 00:04 jorgeorpinel

Any script used in cmd run should be listed under deps right? In this case pretty tricky as .venv/ is probably Git or DVC ignored, so would DVC even cache it? Interesting Q for sure, but it feels like a separate issue/discussion (we could open it though).

Standard practice would only be to include my_script.py as a dep. If I have a stage that calls /usr/bin/bash my_script.sh, I don't add /usr/bin/bash as a pipeline dependency.

And yes, standard practice would also be for users to gitignore a venv directory, the same way that they should be gitignoring __pycache__. But in real world use cases, not everyone is gitignoring everything that they should be

pmrowla avatar Apr 14 '21 01:04 pmrowla

Ah, I see this issue is tricky. I've also checked the other related issues.

To answer your question @dberenbaum, there is no technical reason the credentials files could not be in another, absolute path. However feels like a hack and does not feel right to not have project-related files in the project folder, and we should add another config entry to tell where is the credential file.

At the moment, the only possibility is to git add -f / git remove. It works as a workaround but is not really convenient. I now understand why adding all files (even git-ignored) can be a problem because of folders like venv. I do not expect those to be copied though.

I feel that git-ignored files that are explicitly listed as stage dependencies (like my credentials files) should be copied to the temp folder (without relying on git). This either automatically or through a CLI option.

jdonzallaz avatar Apr 14 '21 09:04 jdonzallaz

I don't add /usr/bin/bash as a pipeline dependency.

That seems different. .venv/bin/python is a relative path inside the workspace.

in real world use cases, not everyone is gitignoring everything that they should

Agree but I'm not sure what you're suggesting @pmrowla, how do we even detect things that *should* be gitignored? I don't know the answer but again, it feels like a separate Q. I'd open another issue but TBH I don't know there's much DVC can do about that.

jorgeorpinel avatar Apr 14 '21 20:04 jorgeorpinel

At the moment, the only possibility is to git add -f / git remove.

@jdonzallaz again please be aware that the files will end up in the Git history this way, even if you don't explicitly git commit them.

I feel that git-ignored files that are explicitly listed as stage dependencies (like my credentials files) should be copied to the temp folder (without relying on git).

Yes! It's something we're considering (#5816): to have way(s) to tell DVC which (if any) untracked files to include in the queued exp, but knowing that DVC won't ever put them in Git. DVC would just cache them locally. And when/it you dvc exp apply that experiment, they would be restored to the workspace as untracked files again. WDYT?

jorgeorpinel avatar Apr 14 '21 20:04 jorgeorpinel

Agree but I'm not sure what you're suggesting @pmrowla, how do we even detect things that should be gitignored?

My point was that we can't do this, and that is a reason why we should not allow blanket adding untracked files. We should make the user be explicit.

pmrowla avatar Apr 15 '21 00:04 pmrowla

@jorgeorpinel Ah you're right, I forgot the part about the files ending in the Git history. What you propose seems the best solution.

Does having the untracked file in the stage dependencies is being explicit ? Otherwise, having an experiment config file or having CLI options look like good solutions to me.

jdonzallaz avatar Apr 15 '21 09:04 jdonzallaz

@jdonzallaz Are you collaborating with others on your project? Are these personal or shared credentials? I'm wondering how this would work on a team where everyone has personal credentials as dependencies, since it seems everyone will have to re-run each pipeline since the credentials will differ for everyone.

@pmrowla @jorgeorpinel Would this be solved by being able to cache data locally only (data gets ignored by dvc push and pull)? It seems like that would enable users to cache sensitive data. We have some other issues open related to fine-grained control of file-level permissions and remotes, so maybe this is a use case for those features.

dberenbaum avatar Apr 15 '21 14:04 dberenbaum

being able to cache data locally only (data gets ignored by dvc push and pull

Yep, I think that's the idea that's forming in https://github.com/iterative/dvc/discussions/5816#discussioncomment-612195.

jorgeorpinel avatar Apr 16 '21 04:04 jorgeorpinel

On second thought, I don't think a local cache would even be the right solution for credentials, because if they change, you probably want to use the updated version and not the cached version.

dberenbaum avatar Apr 16 '21 15:04 dberenbaum

I'm not collaborating with other on this project. We have another project with DVC and collaboration, but the credentials (for the remote) are in the config.local. In this project, the credentials are (can be) shared, as they are specific to the project (and not the user). But you're right that personal credentials will make the pipeline run again. I didn't think about this.

jdonzallaz avatar Apr 16 '21 16:04 jdonzallaz

Hi! Another idea from https://github.com/iterative/dvc/issues/1416#issuecomment-848215096 is to setup some mechanism (e.g. env vars) so that queued experiments can use certain assets from the original workspace directly (without moving them to the tmp dir or committing them to the exp commit).

jorgeorpinel avatar May 25 '21 20:05 jorgeorpinel

Hello, when running experiments in parallel, I am finding that local cache settings are ignored.

I am using dvc 2.9.2.

when I am in my repo:

(venv) [me@computer folder]$ dvc config -l
remote.remote.url=//remote/url
core.remote=remote
remote.remote.url=//local/remote/url
cache.dir=/cache/dir
cache.type=hardlink,symlink

but when I cd to one of the temporary experiment directories:

(venv) [me@computer tmp0zpspb9w]$ dvc config -l
remote.almds.url=//remote/url
core.remote=remote
cache.dir=/cache/dir

I mentioned this on Discord and was told that this could be relevant to this issue.

gregstarr avatar Mar 22 '22 01:03 gregstarr

I had a problem with running queued experiments (normal exp run worked fine): Screenshot 2022-08-06 at 17 18 22

As you can see the problem was with the google service account (strage).

The path to my service account json file was located in the config.local file as per my understanding of the instructions found here:

https://dvc.org/doc/user-guide/setup-google-drive-remote

e.g. Screenshot 2022-08-07 at 16 59 55

I fixed this issue by adding gdrive_service_account_json_file_path to the "global" config file. Not sure if this is the proper way to fix this issue, if I have somehow understood the instructions for setting up google service correctly etc. But might be worth some digging?

JohnTheDeere avatar Aug 07 '22 14:08 JohnTheDeere

@JohnTheDeere your problem is a known issue, see #7254

pmrowla avatar Aug 08 '22 00:08 pmrowla

@pmrowla Any thoughts on how to handle this? Could dvc copy over those contents that we know are part of the local dvc config?

dberenbaum avatar Aug 12 '22 20:08 dberenbaum

@dberenbaum we can pass the local config from the parent repo without copying files. We just have to implement passing the correct parent repo's config into the exp/tmpdir repo constructor. But DVC local config discussion should go in #7254. This issue is a separate problem (how to handle any generic gitignored file, and not DVC-specific configs).

Regarding general gitignored file handling, it's discussed further up in this ticket, but what we would need is a user-specified explicit list of files which they want copied into the temp workspace.

pmrowla avatar Aug 15 '22 03:08 pmrowla

Sorry, I did mean #7254, but since the discussion was here, I commented here. What do you think about prioritizing that one? Seems like that's a fairly substantial bug since it means you can't use dvc config --local with experiments, right?

dberenbaum avatar Aug 15 '22 21:08 dberenbaum

I think it's just a p2 given that the most common use case for config --local is remote auth options, which don't affect anything in experiment runs. The only config --local option I can think of that really matters for exp runs is overriding cache link type, but in exp run we explicitly use the parent repo's cache location and link types, so that use case is already covered.

#7254 will matter more if/when we get to running exps on remote machines, but for now I don't think it needs to be prioritized

pmrowla avatar Aug 16 '22 01:08 pmrowla

I guess for @JohnTheDeere, you are using external dependencies or outputs in your pipeline?

dberenbaum avatar Aug 16 '22 22:08 dberenbaum

Hey @dberenbaum, what kind of external dependencies/outputs are you referring to?

JohnTheDeere avatar Aug 18 '22 07:08 JohnTheDeere

I assumed that you have pipelines dependencies or outputs that are in gdrive instead of in your local workspace. Is that correct? Otherwise, I'm not sure why DVC would be failing to run your queued experiment. Could you share your dvc.yaml?

dberenbaum avatar Aug 19 '22 03:08 dberenbaum

Hi @dberenbaum, this is what my dvc.yaml looks like. In data_load I turn the raw data into PyTorch train/test data loaders. In the train stage I then train the model. Should be pretty straightforward?

stages:
  
  data_load:
    cmd: python3 -m src.data_handler.data_handler --config='params.yaml'
    deps:
      - src/data_handler/data_handler.py
      - data/processed/case_data.sav
    params:
      - data_load.test_size
      - data_load.batch_size
    outs:
      - data/train_data/dataset.pt
  
  train:
    cmd: python3 -m src.train.train --config='params.yaml'
    deps:
      - src/train/train.py
      - data/train_data/dataset.pt
    params:
      - train.learning_rate
      - train.epochs
      - train.temperature
    outs:
      - models/model.pt
    live:
      dvclive:
        summary: true
        html: true

JohnTheDeere avatar Aug 22 '22 07:08 JohnTheDeere