dvc icon indicating copy to clipboard operation
dvc copied to clipboard

git tracked bare dvc repo (only tracking .dvc file, but don't checkout real file)

Open allenyllee opened this issue 3 years ago • 3 comments

Background

We have a lot of daily generated log file, we want to use dvc to tracking our daily log.

Current Method

If we want to use dvc to tracking our daily log, for now, we have to:

  1. Create a git repo and dvc init
  2. Copy the log files into git repo
  3. dvc add those files and dvc commit to generate .dvc file, and then dvc push to transfer files to remote.
  4. git commit the generated .dvc file, and git tag to add a time stamp(or version)
  5. To save local space, remove all the log files, only leave .dvc files

When new daily logs coming, we need to repeat 2-5 step for tracking.

When someone need to analyse log files, they need to: Clone the git repo, git checkout a tagged version, and dvc checkout to download files to the local.

Proposed Method

  1. Provide a single dvc command (something like dvc init --bare --remote or a Python API) to create a "git tracked bare dvc repo" in remote machine
  2. Provide a single dvc command (something like dvc push --transfer --remote or a Python API) to directly transfer daily log to the remote, this command has a --tag option, it will do the above 2-5 step in the remote machine.
  3. When daily logs coming, just do step 2 to transfer files with version tag. (no need to copy into a local git repo)
  4. When someone need to analyse log files, they can: Clone the "git tracked bare dvc repo" with only .dvc files, git checkout a tagged version, and dvc checkout to download files.

Further, because the "git tracked bare dvc repo" should only be modified by the data owner, someone can not push their code to the "git tracked bare dvc repo" remote. Instead, they created a new git repo, and add "git tracked bare dvc repo" as a another git remote. In the git graph, they can see two parallel line: one for our data repo, one for their code repo.

They can cherry-pick a commit from data repo, move .dvc file into other folder, then do dvc checkout, the file will pull from our data repo, downloaded into their folder, then they can start writing their code, commit to their git remote.

Sum up

The "git tracked bare dvc repo" we can treat it as a combination of git bare repo and dvc cache, it's a whole structure only for tracking data blob. It can see as a regular git remote, import as a git submodule, but can only modified by data owner. For the developer, they just include it, pull the data, do their experiments, push to their own repo without touching the data repo.

Also, If you don't use git, you can still treat it as a regular dvc cache remote. But with git, you have full power of git!

Advance

If you have multiple data source and want to share a single data repo, one can provide --source option in proposed step 2, then the command will create a git branch with provided source name. This newly created branch is parallel to other source branch (with no common commit). From developer's view, they can see many parallel branch resides in data repo, and they just need to pick a branch (a data source) to merge into their local working branch.

In case the data owner needs to merge two data source into one, it can be as easy as using git merge in the data repo, to merge two parallel data source branch into one branch!

allenyllee avatar Aug 05 '22 05:08 allenyllee

@allenyllee

When someone need to analyse log files, they need to: Clone the git repo, git checkout a tagged version, and dvc checkout to download files to the local.

Have you thought about using dvc get or dvc import to obtain the data from your log repository?

Provide a single dvc command (something like dvc push --transfer --remote or a Python API) to directly transfer daily log to the remote, this command has a --tag option, it will do the above 2-5 step in the remote machine.

That looks like rather specific use case, which can be achieved with some parametrized script. The steps are repetitive, and it seems to me that this could be achieved with some bash/python parametrized with file_name, in_repo_name, tag. Have you tried this approach?

When daily logs coming, just do step 2 to transfer files with version tag. (no need to copy into a local git repo)

Seems like dvc add --to-remote would be helpful here

When someone need to analyse log files, they can: Clone the "git tracked bare dvc repo" with only .dvc files, git checkout a tagged version, and dvc checkout to download files.

Depending on how you are using this logs later down the road, seems like import or get will be helpful

Further, because the "git tracked bare dvc repo" should only be modified by the data owner, someone can not push their code to the "git tracked bare dvc repo" remote. Instead, they created a new git repo, and add "git tracked bare dvc repo" as a another git remote. In the git graph, they can see two parallel line: one for our data repo, one for their code repo.

I am not sure I understand the DVC's responsibility here - this sounds to me like a git remote with restricted main branch, where users can create their own branches so that they are showed in git log. Am I not understanding something here?

Also, If you don't use git, you can still treat it as a regular dvc cache remote. But with git, you have full power of git!

If you use dvc import or dvc get you can specify which revision to source your data from - allowing to control the version of the data used in personal experiments :)

pared avatar Aug 09 '22 11:08 pared

@pared But I think the main problem is current method:

  1. Create a git repo and dvc init
  2. Copy the log files into git repo
  3. dvc add those files and dvc commit to generate .dvc file, and then dvc push to transfer files to remote.
  4. git commit the generated .dvc file, and git tag to add a time stamp(or version)
  5. To save local space, remove all the log files, only leave .dvc files

First, I need to create git repo before I can use dvc, right? Where should I put this git repo? I don't want that git repo put in my local machine, because the space of local machine is too small. I want that git repo put in the data machine.

Second, I need to copy files into that git repo so that I can do dvc commit, and also I need to do git commit to track file version, right? If I already put the git repo in the data machine, before any data has been dvc tracked, I need to send files to the data machine's git repo, this procedure don't have any dvc command, so as you say, I need to write a script to do the task, and do commit, right?

Third, If I want to save space, I need to remove files in that git repo, only leave .dvc files, right? Or, I can use symlink, the real file will reside on the local cache, but if I want to use that cache as a dvc remote, I need to pick another cache folder, right? Here, I need some manually setup again.

I think dvc's idea is great, but in many production environments, data like a stream, streaming to the data lake or data warehouse. If we want to use dvc to replace current commercial data management tool, it has many inconvenient due to the needed of git repo. But If dvc can do above things out-of-the-box without git repo and self-scripting, I think it can be very competitive to the commercial data management tool.

allenyllee avatar Aug 09 '22 13:08 allenyllee

Ah, got it, so the problem here is data of a big size, where copying it between different machines does not make sense.

First, I need to create git repo before I can use dvc, right? Where should I put this git repo? I don't want that git repo put in my local machine, because the space of local machine is too small. I want that git repo put in the data machine.

From existing features - external outputs and dependencies might be helpful here. In this case locally you only have a git repo and the .dvc files, so that you can avoid copying the data. Here is some info:

  1. External dependencies: https://dvc.org/doc/user-guide/external-dependencies#external-dependencies
  2. External outputs: https://dvc.org/doc/user-guide/managing-external-data#managing-external-data Please note that this is an advanced use case, I recommend creating test project to play around with it to understand what is going on.

Second, I need to copy files into that git repo so that I can do dvc commit, and also I need to do git commit to track file version, right?

True, in aforementioned external use case you would be using dvc commands but providing the paths to data on your data machine. In that case if one of the users want to experiment - they need to do dvc checkout (it will checkout the path in your remote, so you need to remember that at one time only one person should be accessing this path, until they finish downloading it to their own computer, this is why this is considered advanced use case).

Third, If I want to save space, I need to remove files in that git repo, only leave .dvc files, right? Or, I can use symlink, the real file will reside on the local cache, but if I want to use that cache as a dvc remote, I need to pick another cache folder, right? Here, I need some manually setup again.

True

Please ping us if you considered using the external use case - it seems it might be helpful here.

pared avatar Aug 09 '22 14:08 pared

@allenyllee @pared mentioned about the --to-remote option, would that work for you, and if not, why? I'm still trying to better understand your needs and your suggestion, tbh. Could you also clarify - do 'data machine' and 'remote' are the same in your case (some self-hosted box), or remote for you is some cloud, like S3 + you have a separate ELT instance where all logs being collected?

shcheklein avatar Aug 12 '22 23:08 shcheklein

@shcheklein Sorry, I'm not yet tested it. But I saw this question in the dvc fourm: https://discuss.dvc.org/t/large-data-registry-on-nas-with-multiple-dvc-and-non-dvc-users/1294

I think his problem is similar to our's, and I think what I proposed can solve his problem either.

allenyllee avatar Aug 13 '22 02:08 allenyllee

@allenyllee yep, and what I'm trying to understand what exactly is missing / how is it different from the proposal I have in that thread. It would be really helpful if you could try and if something is missing let us know.

shcheklein avatar Aug 13 '22 02:08 shcheklein