multi-gitter icon indicating copy to clipboard operation
multi-gitter copied to clipboard

Support caching repositories

Open japborst opened this issue 2 years ago • 6 comments

Hello!

When using multi-gitter I noticed that on every run the respective repos are always pulled.

It would be great if this could be cached, to avoid long wait times to pull many repositories (especially when the entire org is specified).

japborst avatar Feb 21 '22 10:02 japborst

I think it would be useful if it will not create to much problem for the user. How do you image this to work? 😄 Should the user set a cache timeout themselfs, and if they enable caching expect errors such as merge conflicts which they have to deal with manually.

lindell avatar Feb 23 '22 07:02 lindell

@lindell admitted I didn't deeply think about this yet, but a first version could implement an algorithm such a the following, given a $CACHE_ROOT directory (I'm assuming Github terminology):

  • If $CACHE_ROOT/$org/$repo doesn't exist: check out as usual.
  • If $CACHE_ROOT/$org/$repo does exist, execute a number of commands to get it into a pristine state:
    git clean -fdx
    git fetch --depth=[the-configured-fetch-depth]
    git remote prune origin
    git remote set-head origin -a
    git checkout [the-configured-base-branch]
    git reset --hard origin/[the-configured-base-branch]
    # ^ If not configured, could run the semantic equivalent of e.g.:
    #     git symbolic-ref refs/remotes/origin/HEAD \
    #       | sed "s,^refs/remotes/origin/,," \
    #       | xargs git checkout
    #     git reset --hard refs/remotes/origin/HEAD 
    git submodule update --recursive # If `multi-gitter` currently handles submodules; didn't check.
    
    (I'm no Git guru, so perhaps there's a more straightforward way to reset the repository into a pristine state, containing the n most recent commit on the configured target branch, but the over-all gist would be the same: (a) re-use already-downloaded data, (b) update to match the most recent state, (c) clear any local modifications.)

I suppose there should also be a --trust-cached-repositories flag (better name TBD), so that during rapid prototyping the user can iterate on the script passed to multi-gitter run without incurring any IO overhead.

Stephan202 avatar Feb 23 '22 08:02 Stephan202

@Stephan202 So in that case, multi-gitter would still need to fetch from the remote. I guess this could speed up the process in some cases with very big repos and small changes 🤔 For those usecases it would indeed be useful.

lindell avatar Feb 23 '22 09:02 lindell

Indeed, we have a number of large repos that would benefit from this.

(Currently we have a repository containing all our other repositories as submodules, with various operations performed using git submodule foreach. This can be a bit unwieldy, but does have the benefit of repository state updates being decoupled from modification operations, which avoids extensive waiting between trials, even when on a slow network.)

Stephan202 avatar Feb 23 '22 09:02 Stephan202

To give a little more flavour to the size of the problem: in our case (and I imagine many other companies) running multi-gitter against the entire GitHub org means cloning hundreds of repos. Even using the default depth of 1, that still means fetching between a few MB up to - worst case - a GB.

japborst avatar Feb 24 '22 10:02 japborst

I do agree that this is something that should be added! I will not have the time to look at this any time soon, but if you add it and create a PR, I'm happy to merge it 🙂

lindell avatar Mar 01 '22 07:03 lindell