gto icon indicating copy to clipboard operation
gto copied to clipboard

GTO for large repos

Open aguschin opened this issue 3 years ago • 4 comments

A question by @dberenbaum

Since GTO may be especially useful for large repos: - Do you anticipate any problems using submodules, sparse checkout, etc.? - Will tags get too noisy?

Didn't think about submodules, sparse checkouts, etc yet. As for noisy tags, the biggest number of tags I've seen so far exists in https://github.com/jupyterlab/jupyterlab I've seen a dozen of repos with over than 5 tags, so people use that, not that often though.

Another problem for such number of tags/commits could be the speed. I've tried to parse git tags in Jupyterlab repo, and it seems to work quite fast. But current GTO architecture requires to read artifacts.yaml from each commit. I didn't try to do something like this. This should be next step to check how slow it's going to be.

aguschin avatar Mar 23 '22 10:03 aguschin

Besides performance, my concern would be if users want to use tags the regular way and use gto, it seems like it will pollute the tags with tons of gto tags and be hard to use the regular tags.

dberenbaum avatar Mar 23 '22 14:03 dberenbaum

Agreed. Another question is how you manage this if you have some global release cycle like creating git tags that are names as v1.2.3? E.g. if you release some python package in repo.

aguschin avatar Mar 25 '22 06:03 aguschin

How do you see gto causing problems for typical release tags? Are you worried about conflicts or something else?

One thought: using some special character to start the tag, like {[email protected]}, could help reduce noise by naturally grouping all the gto tags together.

dberenbaum avatar Mar 25 '22 14:03 dberenbaum

How do you see gto causing problems for typical release tags? Are you worried about conflicts or something else?

I'm afraid this may lead to false-positive starts of global release process. E.g. you create a git tag to promote a model, but CI job to release your package to PyPi starts at the same time and tries to make that release.

aguschin avatar Mar 28 '22 05:03 aguschin

I recall running some experiments to optimize this, but the main problem was git tags (or alternatives) running to slow. Maybe after moving to scmrepo this actually works faster.

aguschin avatar Aug 30 '23 14:08 aguschin

Closing this, we moved to scmrepo and to dvc.yaml. Let's hit the scale issue first here.

shcheklein avatar Nov 04 '23 14:11 shcheklein