dvc
dvc copied to clipboard
pipelines: parametrize using environment variables / DVC properties
It would be useful if I could parametrize my pipeline using environment variables, which could be read from a properties file specified using dvc config env my.properties. DVC would load those environment variables when running the command.
For example, I could have this properties file:
DVC_NICKNAME=David
And run:
dvc run -o hello.txt 'echo "Hello ${DVC_NICKNAME}!" > hello.txt'
dvc run -o cheers.txt 'echo "Cheers ${DVC_NICKNAME}!" > cheers.txt'
And produce "Hello David!" and "Cheers David!" files.
Users would just have to make sure to quote the command or use interactive mode #1415.
The DVC file would contain the variable reference:
cmd: echo "Hello ${DVC_NICKNAME}!" > hello.txt
The value would be added to the environment by DVC at DVC startup so it would be handled natively by the shell.
In order for dvc status to be able to detect that variables in a stage changed, we can calculate the internal md5 checksum on contents with the variable values injected in place of the variable names, so that it would be handled as if the contents of the DVC file changed. This can be done using os.path.expandvars. But unfortunately, this would just replace variable references used directly in the shell command, it would not cover cases where you're using the environment variable inside a script. The only foolproof way would be force the user to explicitly request environment variables that would be injected from the properties file, e.g. using dvc run -e DVC_NICKNAME -e DVC_OTHER. That would basically allow adding additional "env dependencies" to stages.
It would be nice to inject the variables also into paths to dependencies, so that you can parametrize those as well. Could also be done using os.path.expandvars. This would change the DAG dynamically, but AFAIK it should actually magically work without breaking anything, right? As long as you just initialize the environment at each DVC startup and call expandvars when reading deps paths.
This one is interesting, @prihoda :thinking:
I've never encounter the of using varibables in DVC commands, maybe my use cases are very simple :sweat_smile: !
This looks like a Makefile behavior, where you can define variables on the top of your file and use them in the rules.
I prefer being "explicit than implicit" and I'll think it twice before introducing this request. Let's leave this open and wait for some thumbs up or comments from other users :slightly_smiling_face:
If you need this right now, a workaround would be using something like direnv and define the variables right there, and just let the shell expand them.
@prihoda If going with plain env vars, I think a more natural approach, would be to define something like a config file, that you would specify as a dependency for your stage. E.g.
# env.sh
DVC_NICKNAME=David
and you would use it in your stage like so:
$ dvc run -d env.sh -o hello.txt 'source env.sh && echo "Hello ${DVC_NICKNAME}!" > hello.txt'
Though, it doesn't solve dynamic dep/out expansion. Maybe we could consider introducing -e env.sh(or maybe env.yml, to make it more cross-platform) option, that would make dvc read it before expanding deps/outs/cmd paths. And it would make env.sh a simple direct dependency for the stage, which seems to suit current dvc architecture nicely. I think we've discussed this briefly in https://github.com/iterative/dvc/issues/1119 . Need to take a closer look into this.
@prihoda could you please describe a "real-life" example where parametrizing pipelines like this would give benefits? Was you trying to solve some problem? (I do have some ideas on my own, I just want to know your thoughts on this)
@mroutis what do you mean by "explicit vs implicit" in this case? May be I missing something, but having a way of passing some parameters (in a way that DVC tracks changes, expands, etc) can be done more or less explicitly - like an explicit config file with all these parameters.
@efiop creating a single config file and using it as a dependency (via -d) breaks granularity of caching - every change in this global config makes the whole pipeline (bc usually a lot of stages depend on different variables in this config) outdated.
@shcheklein It can be used in any pipeline when you're providing the same parameters in different stages. I was solving it by manually specifying the parameter multiple times and I didn't realize it could be solved using a custom config file provided as a dependency as suggested by @efiop.
The problem is that if the config properties were provided as environment variables, even a global DVC config file would have to break granularity of caching, since you could use those variables hidden inside bash scripts so there would be no way to check which variables are used.
[edited] So the only benefit would probably be if the variables could also be used in dependencies/outputs. For example, configuring the highest performing model file and using that throughout the pipeline. But not sure it's worth the effort - currently I'm solving it by just having a special location "models/top.pkl" where I copy it.
@mroutis what do you mean by "explicit vs implicit" in this case?
I was using "explicit" to refer to any commands using additional context from the environment (for example, variables).
However, I really like the ideas proposed:
- Having the variables as a dependency (either with the
-eoption or anenv.shfile), this way, if the env changes, it is going to be reproduced (with the-eoption we could even raise an error if the user doesn't have those variables on their environment)
I like the idea with -e as well. To be completely fair, I don't like that with DVC you have to specify (and keep them up to date) all the dependencies yourself, but I don't see any good implicit options.
I don't like that with DVC you have to specify (and keep them up to date) all the dependencies yourself
@shcheklein, have you seen any other solution that deals with dependencies implicitly?
Maybe we could watch the current directory for events triggered by the command's PID (implying that every "read" file is a dependency and "created" one an output), sadly, there's a lot of edge cases :disappointed: (process creating temp files, windows support, remote dependencies/outputs as HTTP or S3, etc.)
The only solution that I'm thinking about is "implicit" rules a la Makefile, but I don't think something similar could work for DVC.
By the way, I didn't understand quite well the implications of "breaking granularity of caching" by having a file with parameters, would you mind explaining?
By the way, I didn't understand quite well the implications of "breaking granularity of caching" by having a file with parameters, would you mind explaining?
Yep. Let's imagine you have a global env.sh file with two parameters A and B. And you have two stages - S1 and S2. S1 depends on (uses) A, S2 depends on B. We have to specify -d env.sh for both stages to capture these dependencies. The problem is that -d env.sh is not granular enough in a sense that if you change A dvc makes S2 stale along with S1 and we have to run it again. Basically, usually what I saw happening in this scenario (one global config) is that almost every stage depends on this single file and every change to file invalidates all intermediate results (cached data produced by some intermediate stages in the pipeline DAG). Hope all of this makes sense :)
@shcheklein, have you seen any other solution that deals with dependencies implicitly?
Make does not itself implicitly derive dependencies as far as I remember. You have to specify them (or use autotools or gcc to parse source and create a list with dependencies). CMake does this automatically. But I agree, I don't think this can work with DVC.
I don't know if there are other tools to be honest.
Maybe we could watch the current directory for events triggered by the command's PID (implying that every "read" file is a dependency and "created" one an output), sadly, there's a lot of edge cases 😞 (process creating temp files, windows support, remote dependencies/outputs as HTTP or S3, etc.)
Yep. This is fragile. I think @dmpetrov and @efiop tried this already a while ago.
Another user has expressed a use case for supporting env vars in stage/pipeline files. Context: https://discord.com/channels/485586884165107732/485596304961962003/715639271914209332
I want to create a dvc stage with code from a third-party python package, that needs to be installed first. Since the path to the code of this source file might look different for different contributors, I wonder if it is even possible to track such a file as a dependancy I'm talking about an executable from a third party python package that I would direclty use as a source file in the stage they will just go in the corresponding Python installation bin folder and would then be in the
PYTHONPATHafterwards..as an example: pip install <some_package> -> executbalte then lies under ~/.local.
/bin or something then I want to use it as dvc run -d <path_to_the_executable_on_the_system>.py ... python <path_to_the_executable_on_the_system>.py ... problem will be that <path_to_the_executable_on_the_system>.py will look different for other contributors of the project..
Another user expressed interest in this feature implicitly in https://discord.com/channels/485586884165107732/485596304961962003/765677497843974204:
what if I want to execute pipeline for another dataset, should I do it manually? change params.yaml and other stuff every time?
I think this would also solve another issue that I am experiencing. The situation is this:
- I run a dvc pipeline within a virtualenv in a local folder
./build/virtualenv. - Within the pipeline I call rasa, an ML framework with the path
./build/virtualenv/bin/rasa. - When I queue up multiple experiments with
dvc exp run ... --queueand then run them withdvc run --run-allI get an error that./build/virtualenv/bin/rasais not found. - If I hardcode the location of the executable to
/home/ubunut/project/build/virtualenv/bin/rasathen queuing experiments works. - I don't want to have to hardcode this path because it is not the same across the team's machines, which is where env var expansion in
dvc.yamlwould be helpful. - I would set
RASA_PATH=/home/ubuntu/project/build/virtualenv/bin/rasaand the in mydvc.yamlset:cmd: ${RASA_PATH} train
Perhaps there is another way to resolve this issue with experiments, env var expansion would also do it.
When I queue up multiple experiments with dvc exp run ... --queue and then run them with dvc run --run-all I get an error that ./build/virtualenv/bin/rasa is not found.
@ivyleavedtoadflax this is because queued experiments run in a temporary folder outside your workspace. See https://dvc.org/doc/command-reference/exp/run#queueing-and-parallel-execution
Just FYI. It's a good question but maybe there's something planned for dvc exp that will better address that scenario? Cc @pmrowla
p.s. actually there's already a workaround/trick (see https://github.com/iterative/dvc/issues/5800#issuecomment-818389723):
$ git add -f ./build/virtualenv
$ dvc exp run ... --queue
...
$ git reset
$ dvc run --run-all
Anything Git staged at the time of queueing experiments (no commit necessary) will be included in that exp's temp dir.
⚠️ But if/when you later exp apply one of those experiments, build/virtualenv/ would end up in the Git repo.
thanks @jorgeorpinel, I was hoping you might say something like that :clap:
Ahh, realising this is rather slow when the virtualenv is 1.6GB and you queue up a whole load of experiments :facepalm:
Yeah that too 🙁
I think the original issue is about tracking the expanded environment variable value as a dependency, so that /home/ubuntu/project/build/virtualenv/bin/rasa becomes a dependency, and if another user sets RASA_PATH=/Users/user/project/build/virtualenv/bin/rasa, then it will get recorded as a dependency change. This seems like the opposite behavior from what's desired by @ivyleavedtoadflax, where users may have different paths to their virtualenv, but those shouldn't be tracked as dependency changes.
* I would set `RASA_PATH=/home/ubuntu/project/build/virtualenv/bin/rasa` and the in my `dvc.yaml` set: `cmd: ${RASA_PATH} train`
You might need to escape the $ to cmd: \${RASA_PATH} train, but I think this should work otherwise. You could alternatively do PATH=/home/ubuntu/project/build/virtualenv/bin:$PATH, and set cmd: rasa train. Does that work?
You might need to escape the $ to cmd: ${RASA_PATH} train, but I think this should work otherwise.
Yes! This does work. Thanks!
Possibly another application for this suggested in https://github.com/iterative/dvc/discussions/6113#discussioncomment-823935
I'm just starting with DVC and there may be more correct ways to do what I have initially came up with, but since I couldn't find anything in the documentation or forums this is what I did. Context: a git repo (hosted by Bitbucket) with DVC (tracked files in an S3 bucket under a project specific directory).
- we already have .env files for staging and production
- our code uses variables defined in the .env files to run, build, deploy, log, etc
- we use Bitbucket pipelines, but most of the work is done by our bash scripts
- because we're using AWS we have on our developer machines $HOME/.aws/config and $HOME/.aws/credentials
- these credentials are also the .env files but because they are for deployments they have names like DEPLOYMENT_AWS_ACCESS_KEY_ID and the AWS_ACCESS_KEY_ID is for run time on EC2.
- we have multiple developers, some work on machine learning models (pytorch) and others do processing code and devops (myself), and some do machine learning and processing.
Given the above, I have written a script for deployment that assumes that the .env file has been parsed and all of the definitions are available in the current script environment. It takes the "developer centric" DVC configuration for our remote storage and converts it to use the script environment variables. I didn't see anything that explained a better way to do this and came up with this workaround to provide the credentials through environment variables:
set +u
if [[ -n "${DEPLOYMENT_AWS_ACCESS_KEY_ID}" ]]; then
export AWS_ACCESS_KEY_ID="${DEPLOYMENT_AWS_ACCESS_KEY_ID}"
else
echo "Warning: DEPLOYMENT_AWS_ACCESS_KEY_ID is not defined. Using AWS_ACCESS_KEY_ID" >&2
fi
if [[ -n "${DEPLOYMENT_AWS_SECRET_ACCESS_KEY}" ]]; then
export AWS_SECRET_ACCESS_KEY="${DEPLOYMENT_AWS_SECRET_ACCESS_KEY}"
else
echo "Warning: DEPLOYMENT_AWS_SECRET_ACCESS_KEY is not defined. Using AWS_SECRET_ACCESS_KEY" >&2
fi
if [[ -n "${DEPLOYMENT_AWS_DEFAULT_REGION}" ]]; then
export AWS_DEFAULT_REGION="${DEPLOYMENT_AWS_DEFAULT_REGION}"
else
echo "Warning: DEPLOYMENT_AWS_DEFAULT_REGION is not defined. Using AWS_DEFAULT_REGION" >&2
fi
set -u
REMOTE_STORAGE_PROFILE=""
REMOTE_STORAGE_CREDENTIALPATH=""
# remove the local version of remote.storage.credentialpath and use
# the environment variables this is likely only on a development machine
set +e
REMOTE_STORAGE_PROFILE="$(dvc config --project remote.storage.profile)"
REMOTE_STORAGE_CREDENTIALPATH="$(dvc config --local remote.storage.credentialpath)"
dvc config --project --unset remote.storage.profile
dvc config --local --unset remote.storage.credentialpath
echo "REMOTE_STORAGE_PROFILE = ${REMOTE_STORAGE_PROFILE}"
echo "REMOTE_STORAGE_CREDENTIALPATH = ${REMOTE_STORAGE_CREDENTIALPATH}"
set -e
dvc pull --verbose
if [[ -n "${REMOTE_STORAGE_PROFILE}" ]]; then
# restore the value for remote.storage.profile if it was set before
dvc config --project remote.storage.profile "${REMOTE_STORAGE_PROFILE}"
fi
if [[ -n "${REMOTE_STORAGE_CREDENTIALPATH}" ]]; then
# restore the value for remote.storage.credentialpath if it was set before
dvc config --local remote.storage.credentialpath "${REMOTE_STORAGE_CREDENTIALPATH}"
fi
If I could have used $HOME in my .dvc/config I could have used --project configuration everywhere. As it is each developer will need to run dvc config --local remote.storage.credentialpath "$HOME/.aws/credentials" in their working copy of the repository. I could have also created a $HOME/.aws/credentials file with the correct content in the bitbucket environment.
Instead I kind of aimed for the middle of the road, thinking that I could define the DVC remote.storage.url, remote.storage.profile, and remote.storage.credentialpath in a cross-developer way, I started down that path. But then had to remove remote.storage.profile and remote.storage.credentialpath from the DVC configuration when building on bitbucket.
We run dvc pipelines inside of gitlab pipelines, and this feature would be extraordinarily helpful for gathering information on what branch the pipeline is running on, etc, and making additional commits after processing.
@stephanrb3 How would you use the feature to gather info on the branch or make additional commits? Do you have some example you could share?
We run DVC pipelines inside of bitbucket pipelines, and some of our pipelines require authenticated access to certain resources (e.g. databases, AWS tokens, etc.).
Ideally, we would like to provide these as environment variables, but it seems like this isn't possible at the moment.
We run DVC pipelines inside of bitbucket pipelines, and some of our pipelines require authenticated access to certain resources (e.g. databases, AWS tokens, etc.).
Ideally, we would like to provide these as environment variables, but it seems like this isn't possible at the moment.
Hi, @nishanthmerwin
At least S3, Azure, GCS, GDrive and maybe some other clouds all support environment variables as credentials.
https://dvc.org/doc/command-reference/remote/modify#authenticate-with-an-azure-config-file
So which cloud are you using?
Hi @dberenbaum,
I'd like to join the conversation and mention that our use case is very similar to the one described by @stephanrb3. We run DVC pipelines within GitLab CI-CD pipelines and we've found that having environmental variables dependency would be very helpful. Specifically, we tag our images with branch names and would like to re-run the DVC stage that builds the image every time the branch name changes.
@sukhovvl Have you tried something like dvc exp run --set-param branch=$CI_COMMIT_BRANCH? Am I getting the idea right on what you are trying to do?
Yes, that's right. There are many workarounds, for example, what we currently do is adding a dedicated stage:
stage1:
cmd:
- git symbolic-ref --short HEAD > branch.name
outs:
- branch.name
stage2:
deps:
- branch.name
But from the pure UX perspective dependency on an env variable seems like a nice feature to have.