dvc icon indicating copy to clipboard operation
dvc copied to clipboard

`git gc` and (many) queued DVC experiments

Open tibor-mach opened this issue 1 year ago • 5 comments
trafficstars

We have a report of a following error in git garbage collection (presumably caused by running a lot of dvc experiments in a queue) from one of our customers:

error: The last gc run reported the following. Please correct the root cause and remove gc.log.
Automatic cleanup will not be performed until the file is removed.
warning: There are too many unreachable loose objects; run 'git prune' to remove them. 

Git version is 2.34.1, DVC version is 3.33.3

They are using a python script which runs a subprocess with bash commands to generate those experiments but the script does nothing special, it just queues experiments with different parameter sets for gridsearch hyperparameter tuning.

My impression is that this might have something to do with git gc perhaps not being run automatically while all of these experiments are created and so it has to be run manually as suggested in the git gc documentation

Running git gc manually should only be needed when adding objects to a repository without regularly running such porcelain commands, to do a one-off repository optimization, or e.g. to clean up a suboptimal mass-import. See the "PACKFILE OPTIMIZATION" section in git-fast-import[1] for more details on the import case.

My initial guess is that simply running git gc manually is enough to resolve this without any unwanted side-effects, but I will try to do this first.

tibor-mach avatar Feb 07 '24 14:02 tibor-mach

Could you please provide more information?

Is that above error message from dvc? If it is, could you please also include verbose output from dvc?

skshetry avatar Feb 07 '24 14:02 skshetry

@skshetry No, this is an output from git

tibor-mach avatar Feb 07 '24 14:02 tibor-mach

FYI so far have been unable to reproduce the error...requesting more info from the customer.

tibor-mach avatar Feb 08 '24 14:02 tibor-mach

My impression is that this might have something to do with git gc perhaps not being run automatically while all of these experiments are created and so it has to be run manually as suggested in the git gc documentation

My initial guess is that simply running git gc manually is enough to resolve this without any unwanted side-effects, but I will try to do this first.

This is correct. git gc is not run when DVC does Git operations for experiments. This is normally not an issue because we expect the user to eventually use a CLI Git command at some point as a part of their normal workflow (at which time CLI Git does git gc --auto to automatically gc whatever is needed).

git prune is also run as a part of git gc, but in this case the warning is due to this CLI Git behavior:

If the file gc.log exists, then git gc --auto will print its content and exit with status zero instead of running unless that file is more than gc.logExpiry old. Default is "1.day". See gc.pruneExpire for more ways to specify its value.

(see https://git-scm.com/docs/git-gc#Documentation/git-gc.txt-gclogExpiry)

If the user removes gc.log (I think it's probably located in .git/gc.log) the automatic git gc should work normally without printing any further warnings, but the user can also just wait a day for the log file to expire (assuming they have not modified the gc.logExpiry Git configuration option).

pmrowla avatar Feb 09 '24 01:02 pmrowla

In the event that the user runs into this again, it is safe to manually run git prune as the message suggests (or you can manually run git gc), but I don't think it's actually needed. (I think the suggested fix is more for the case where you are running git gc manually)

My understanding here is that what is happening is:

  1. user runs a lot of DVC experiments (which given enough experiments will end up generating a lot of loose git objects)
  2. user finally runs a CLI Git command
    1. git gc --auto is run automatically by CLI Git
    2. git gc sees an unexpected state (more loose objects than it normally expects) and generates the log file message to notify the user about that state, but git gc --auto will still end up running git prune as usual
  3. user runs another CLI Git command
    1. git gc --auto is run automatically by CLI Git
    2. git gc --auto sees the log file generated by the prior command and the log file has not yet expired, so it prints the contents of the log file and then exits

pmrowla avatar Feb 09 '24 01:02 pmrowla

I don't think there is anything for us to do here. As Peter mentioned, it's okay to do a git gc``/git prune and is the correct fix/suggestion. Closing.

skshetry avatar Mar 25 '24 10:03 skshetry