`dvc import` compatible with GitHub App Token
I haven't seen any proposal of this kind in the issues and - based on my use case - it could solve a number of problems.
Scenario:
- you have a Data Registry (as git repo + cloud storage, e.g., AWS S3);
- you have a Experiment Repository in which you have the code that runs experiments (and experiments use data from Data Registry);
- you wrap this thing with CML and you use GitHub App with Access Tokens
Problem:
- suppose you use
dvc importto obtainsome_datafrom the Data Registry (call it:github.com/username/DataRegistry) - it will be recorded in
dvc.lockasdeps: - path: some_data repo: url: [email protected]:username/DataRegistry.git rev_lock: af6a1feb542dc05b4d3e9c80deb50e6596876e5f - now the problem occurs: CML runs this pipeline on instance and when it tries to get the data from Data Registry remote it fails, as it cannot clone the Data Registry repository (in order to do so, it would need to use generated app token).
Proposition:
- it would be nice if
dvc import(or actuallydvc pull?) checked forDATA_REGISTRY_TOKENenv variable and updated the url "on the fly" when pulling data from the remote.
Disclaimer: I was intending on writing this some months ago, at the time the desired behaviour was not in place. I did a quick look, but did not find any mention of it.
Thanks for your effort and please ask any questions in case you need clarification!
@casperdcl FYI. Any thoughts on this scenario?
I'm not sure I follow. Is the issue about authentication for dvc in CI using env vars? That's already supported (vis https://dvc.org/doc/command-reference/remote/modify#available-parameters-per-storage-type) e.g. AWS_ACCESS_KEY_ID & AWS_SECRET_ACCESS_KEY.
Or do you mean DVC's deps.*.repo.url is a private repo that needs a PAT for pull access? In which case I guess DVC could support a REPO_TOKEN env var for authentication the same way CML does. Plus it would need a CLI API for it - presumably dvc import --token=... though not sure where it should store said token. Presumably not in dvc.yaml but in the system config? Would mean treating the repo URL like a data remote URL (i.e. give it a shortname, save creds in user config dirs, etc.)
@casperdcl This one
Or do you mean DVC's
deps.*.repo.urlis a private repo that needs a PAT for pull access? In which case I guess DVC could support aREPO_TOKENenv var for authentication the same way CML does. Plus it would need a CLI API for it - presumablydvc import --token=...though not sure where it should store said token. Presumably not indvc.yamlbut in the system config? Would mean treating the repo URL like a data remote URL (i.e. give it a shortname, save creds in user config dirs, etc.)
Although I believe the PAT / App Token should not be stored, since (in case of the App Token) it will be re-generated every time the pipeline is run (e.g., in GitHub action). One idea for a solution could be to have --import-token that would work with other dvc commands (e.g., dvc repro), which - when passed - would make sure that anything that was obtained with dvc import would use the passed token to authenticate when checking out the repo under url key.
@dtrifiro Any idea how this should work after dulwich upgrades?
@dberenbaum
If you're thinking of support for git credential helpers, one way this could work is the following
- Setup a credential helper (could even be
git credential-cache, if cli git is available - Store the credential in the helper
- Actually perform the operation.
For example:
echo "[credential]\n helper=cache" >> ~/.gitconfig
printf "url=https://github.com\nusername=username\npassword=password\n" | git credential-cache store
dvc import https://github.com//[...]
This looks a bit clunky to me, although this would work starting with the next dvc release (see https://github.com/iterative/scmrepo/pull/138).
An alternative would be setting up credentials sections in the dvc config that can be looked up when performing import or import-url, something like:
['credential "https://github.com"']
username = username
password = password
Might be also be worth it to provide facilities to write values to the config, something like
dvc config set credential.https://github.com username username
dvc config set credential.https://github.com password password
Cons with this approach:
- configuring git credentials in the dvc config seems a bit out of place
- possibly duplicating functionality provided by git (see
man gitcredentials) - storing passwords in config files (although this could be similar to storing remote credentials in
--localconfig)
Hm, in this case where there is an import from a data registry repo, can the token work over SSH, or would we need to convert to HTTP?
A similar report from a user who wants to dvc import from a private repo inside their CI environment: https://discord.com/channels/485586884165107732/485596304961962003/1057317845744238644.
hey, any update on having a new feature to import from private repository without using git ssh key?
@moisesrc13 The credential helper support mentioned above is now implemented, so you should be able to use that and authenticate to a private repo in the same ways you can using the git cli.
@moisesrc13 The credential helper support mentioned above is now implemented, so you should be able to use that and authenticate to a private repo in the same ways you can using the git cli.
Thanks. Will give it a try.