GTO for large repos
A question by @dberenbaum
Since GTO may be especially useful for large repos: - Do you anticipate any problems using submodules, sparse checkout, etc.? - Will tags get too noisy?
Didn't think about submodules, sparse checkouts, etc yet. As for noisy tags, the biggest number of tags I've seen so far exists in https://github.com/jupyterlab/jupyterlab I've seen a dozen of repos with over than 5 tags, so people use that, not that often though.
Another problem for such number of tags/commits could be the speed. I've tried to parse git tags in Jupyterlab repo, and it seems to work quite fast. But current GTO architecture requires to read artifacts.yaml from each commit. I didn't try to do something like this. This should be next step to check how slow it's going to be.
Besides performance, my concern would be if users want to use tags the regular way and use gto, it seems like it will pollute the tags with tons of gto tags and be hard to use the regular tags.
Agreed. Another question is how you manage this if you have some global release cycle like creating git tags that are names as v1.2.3? E.g. if you release some python package in repo.
How do you see gto causing problems for typical release tags? Are you worried about conflicts or something else?
One thought: using some special character to start the tag, like {[email protected]}, could help reduce noise by naturally grouping all the gto tags together.
How do you see gto causing problems for typical release tags? Are you worried about conflicts or something else?
I'm afraid this may lead to false-positive starts of global release process. E.g. you create a git tag to promote a model, but CI job to release your package to PyPi starts at the same time and tries to make that release.
I recall running some experiments to optimize this, but the main problem was git tags (or alternatives) running to slow. Maybe after moving to scmrepo this actually works faster.
Closing this, we moved to scmrepo and to dvc.yaml. Let's hit the scale issue first here.