dvc
dvc copied to clipboard
`git gc` and (many) queued DVC experiments
We have a report of a following error in git garbage collection (presumably caused by running a lot of dvc experiments in a queue) from one of our customers:
error: The last gc run reported the following. Please correct the root cause and remove gc.log.
Automatic cleanup will not be performed until the file is removed.
warning: There are too many unreachable loose objects; run 'git prune' to remove them.
Git version is 2.34.1, DVC version is 3.33.3
They are using a python script which runs a subprocess with bash commands to generate those experiments but the script does nothing special, it just queues experiments with different parameter sets for gridsearch hyperparameter tuning.
My impression is that this might have something to do with git gc perhaps not being run automatically while all of these experiments are created and so it has to be run manually as suggested in the git gc documentation
Running git gc manually should only be needed when adding objects to a repository without regularly running such porcelain commands, to do a one-off repository optimization, or e.g. to clean up a suboptimal mass-import. See the "PACKFILE OPTIMIZATION" section in git-fast-import[1] for more details on the import case.
My initial guess is that simply running git gc manually is enough to resolve this without any unwanted side-effects, but I will try to do this first.
Could you please provide more information?
Is that above error message from dvc? If it is, could you please also include verbose output from dvc?
@skshetry No, this is an output from git
FYI so far have been unable to reproduce the error...requesting more info from the customer.
My impression is that this might have something to do with
git gcperhaps not being run automatically while all of these experiments are created and so it has to be run manually as suggested in thegit gcdocumentationMy initial guess is that simply running
git gcmanually is enough to resolve this without any unwanted side-effects, but I will try to do this first.
This is correct. git gc is not run when DVC does Git operations for experiments. This is normally not an issue because we expect the user to eventually use a CLI Git command at some point as a part of their normal workflow (at which time CLI Git does git gc --auto to automatically gc whatever is needed).
git prune is also run as a part of git gc, but in this case the warning is due to this CLI Git behavior:
If the file gc.log exists, then git gc --auto will print its content and exit with status zero instead of running unless that file is more than gc.logExpiry old. Default is "1.day". See gc.pruneExpire for more ways to specify its value.
(see https://git-scm.com/docs/git-gc#Documentation/git-gc.txt-gclogExpiry)
If the user removes gc.log (I think it's probably located in .git/gc.log) the automatic git gc should work normally without printing any further warnings, but the user can also just wait a day for the log file to expire (assuming they have not modified the gc.logExpiry Git configuration option).
In the event that the user runs into this again, it is safe to manually run git prune as the message suggests (or you can manually run git gc), but I don't think it's actually needed. (I think the suggested fix is more for the case where you are running git gc manually)
My understanding here is that what is happening is:
- user runs a lot of DVC experiments (which given enough experiments will end up generating a lot of loose git objects)
- user finally runs a CLI Git command
git gc --autois run automatically by CLI Gitgit gcsees an unexpected state (more loose objects than it normally expects) and generates the log file message to notify the user about that state, butgit gc --autowill still end up runninggit pruneas usual
- user runs another CLI Git command
git gc --autois run automatically by CLI Gitgit gc --autosees the log file generated by the prior command and the log file has not yet expired, so it prints the contents of the log file and then exits
I don't think there is anything for us to do here. As Peter mentioned, it's okay to do a git gc``/git prune and is the correct fix/suggestion. Closing.