dvc icon indicating copy to clipboard operation
dvc copied to clipboard

`dvc import` compatible with GitHub App Token

Open mikolajpabiszczak opened this issue 3 years ago • 10 comments

I haven't seen any proposal of this kind in the issues and - based on my use case - it could solve a number of problems.

Scenario:

  • you have a Data Registry (as git repo + cloud storage, e.g., AWS S3);
  • you have a Experiment Repository in which you have the code that runs experiments (and experiments use data from Data Registry);
  • you wrap this thing with CML and you use GitHub App with Access Tokens

Problem:

  • suppose you use dvc import to obtain some_data from the Data Registry (call it: github.com/username/DataRegistry)
  • it will be recorded in dvc.lock as
     deps:
       - path: some_data
         repo:
           url: [email protected]:username/DataRegistry.git
           rev_lock: af6a1feb542dc05b4d3e9c80deb50e6596876e5f
    
  • now the problem occurs: CML runs this pipeline on instance and when it tries to get the data from Data Registry remote it fails, as it cannot clone the Data Registry repository (in order to do so, it would need to use generated app token).

Proposition:

  • it would be nice if dvc import (or actually dvc pull ?) checked for DATA_REGISTRY_TOKEN env variable and updated the url "on the fly" when pulling data from the remote.

Disclaimer: I was intending on writing this some months ago, at the time the desired behaviour was not in place. I did a quick look, but did not find any mention of it.

Thanks for your effort and please ask any questions in case you need clarification!

mikolajpabiszczak avatar Jul 29 '22 09:07 mikolajpabiszczak

@casperdcl FYI. Any thoughts on this scenario?

dberenbaum avatar Sep 01 '22 16:09 dberenbaum

I'm not sure I follow. Is the issue about authentication for dvc in CI using env vars? That's already supported (vis https://dvc.org/doc/command-reference/remote/modify#available-parameters-per-storage-type) e.g. AWS_ACCESS_KEY_ID & AWS_SECRET_ACCESS_KEY.

Or do you mean DVC's deps.*.repo.url is a private repo that needs a PAT for pull access? In which case I guess DVC could support a REPO_TOKEN env var for authentication the same way CML does. Plus it would need a CLI API for it - presumably dvc import --token=... though not sure where it should store said token. Presumably not in dvc.yaml but in the system config? Would mean treating the repo URL like a data remote URL (i.e. give it a shortname, save creds in user config dirs, etc.)

casperdcl avatar Sep 05 '22 15:09 casperdcl

@casperdcl This one

Or do you mean DVC's deps.*.repo.url is a private repo that needs a PAT for pull access? In which case I guess DVC could support a REPO_TOKEN env var for authentication the same way CML does. Plus it would need a CLI API for it - presumably dvc import --token=... though not sure where it should store said token. Presumably not in dvc.yaml but in the system config? Would mean treating the repo URL like a data remote URL (i.e. give it a shortname, save creds in user config dirs, etc.)

Although I believe the PAT / App Token should not be stored, since (in case of the App Token) it will be re-generated every time the pipeline is run (e.g., in GitHub action). One idea for a solution could be to have --import-token that would work with other dvc commands (e.g., dvc repro), which - when passed - would make sure that anything that was obtained with dvc import would use the passed token to authenticate when checking out the repo under url key.

mikolajpabiszczak avatar Sep 06 '22 11:09 mikolajpabiszczak

@dtrifiro Any idea how this should work after dulwich upgrades?

dberenbaum avatar Sep 07 '22 18:09 dberenbaum

@dberenbaum

If you're thinking of support for git credential helpers, one way this could work is the following

  1. Setup a credential helper (could even be git credential-cache, if cli git is available
  2. Store the credential in the helper
  3. Actually perform the operation.

For example:

echo "[credential]\n    helper=cache" >> ~/.gitconfig 
printf "url=https://github.com\nusername=username\npassword=password\n" | git credential-cache store
dvc import https://github.com//[...]

This looks a bit clunky to me, although this would work starting with the next dvc release (see https://github.com/iterative/scmrepo/pull/138).

An alternative would be setting up credentials sections in the dvc config that can be looked up when performing import or import-url, something like:

['credential "https://github.com"']
username = username
password = password

Might be also be worth it to provide facilities to write values to the config, something like

dvc config set credential.https://github.com username username       
dvc config set credential.https://github.com password password       

Cons with this approach:

  • configuring git credentials in the dvc config seems a bit out of place
  • possibly duplicating functionality provided by git (see man gitcredentials)
  • storing passwords in config files (although this could be similar to storing remote credentials in --local config)

dtrifiro avatar Sep 12 '22 13:09 dtrifiro

Hm, in this case where there is an import from a data registry repo, can the token work over SSH, or would we need to convert to HTTP?

dberenbaum avatar Sep 16 '22 20:09 dberenbaum

A similar report from a user who wants to dvc import from a private repo inside their CI environment: https://discord.com/channels/485586884165107732/485596304961962003/1057317845744238644.

dberenbaum avatar Dec 30 '22 18:12 dberenbaum

hey, any update on having a new feature to import from private repository without using git ssh key?

moisesrc13 avatar Oct 18 '23 22:10 moisesrc13

@moisesrc13 The credential helper support mentioned above is now implemented, so you should be able to use that and authenticate to a private repo in the same ways you can using the git cli.

dberenbaum avatar Oct 19 '23 20:10 dberenbaum

@moisesrc13 The credential helper support mentioned above is now implemented, so you should be able to use that and authenticate to a private repo in the same ways you can using the git cli.

Thanks. Will give it a try.

moisesrc13 avatar Oct 20 '23 03:10 moisesrc13