augur icon indicating copy to clipboard operation
augur copied to clipboard

Is it possible to get the necessary git information without storing the entire repository

Open cdolfi opened this issue 1 year ago • 10 comments

@GregSutcliffe has a hypothesis to be tested further that cloning the entire repository might not be necessary with the potential of saving a lot of storage space

git clone --bare may be a path forward on disk space, and may perhaps protect you against history rewrites as well

More investigation is necessary to see if this is a viable path

@sgoggins would you be able to assign this issue to @GregSutcliffe ?

cdolfi avatar Dec 04 '24 16:12 cdolfi

@cdolfi : I invited @GregSutcliffe to the Augur repository, which GitHub makes me do before assigning him stuff.

sgoggins avatar Jan 08 '25 15:01 sgoggins

I've accepted. I did a bit of work looking into bare clones a while back, which seems like the way forward, but it needs confirming & testing.

GregSutcliffe avatar Jan 09 '25 15:01 GregSutcliffe

Hi @GregSutcliffe,

Thanks for taking this on! I’d love to contribute to this effort as well. Could you share any insights or progress you’ve made so far regarding the investigation into bare clones? This would help me get up to speed and start exploring ways to extend or build upon your work.

Looking forward to collaborating on this! cc: @sgoggins @cdolfi @ElizabethN

musaqlain avatar Mar 08 '25 20:03 musaqlain

@cdolfi what information is needed? i have done some random things with regard to checking out parts of a repo before, notably by cloning a repo but filtering out objects. this would preserve the file structure (tree) and all the commits, but largely remove all the file contents.

example: git clone --filter=blob:none --no-checkout https://android.googlesource.com/platform/manifest

MoralCode avatar Mar 09 '25 02:03 MoralCode

afaik a bare clone is essentially just the contents of .git, which i believe still stores the full repositoriy's worth of objects (ignoring LFS/submodules) and may only save a little space compared to the filtered method above

other strategies (which dont produce not-fully-usable) repositories include:

  • shallow clone: https://git-scm.com/docs/git-clone#Documentation/git-clone.txt-code--depthltdepthgtcode (clones only the most recent N commits, could be useful if the a recent snapshot of the code itself needs analyzing while avoiding extra overhead)
  • shallow until and shallow since (the options immediately after the above in the manpage) for doing it by date, this could be useful for only cloning a repo's commits that occurred since the date of a previous collection
  • there should also be an option to limit it by branches i think

Assuming these repos are only kept around for gathering data, and then being deleted (otherwise we would become a low-fi software heritage clone), i feel like these options cover a pretty good range of tradeoffs to be able to cover different usecases

MoralCode avatar Mar 09 '25 02:03 MoralCode

@MoralCode you're not wrong, but you're not going far enough :)

If you combine a --bare checkout with other options, you can get a lot further. Notably, we only care about the git history, not the files themselves, and we can filter them out using --filter-blob:none. You can also go further and try adding --single-branch since we only look at the default branch anyway.

Here's some test results on ansible/ansible (a decent size repo)

$ git clone https://github.com/ansible/ansible ansible-full
$ git clone --bare https://github.com/ansible/ansible ansible-bare
$ git clone --bare --filter=blob:none https://github.com/ansible/ansible ansible-filter
$ git clone --bare --filter=blob:none --single-branch https://github.com/ansible/ansible ansible-filter-single

$ du -sh *

301M    ansible-full
263M    ansible-bare
96M     ansible-filter
71M     ansible-filter-single

So yes, --bare isn;t a lot - just 13% - but the full effect is 74% with all the options. And git log appears to work fine:

$ cd anible-filter-single && git log --oneline
df08ed3ef3 🔥 Remove Python 2 datetime compat fallbacks
50b4e0d279 facts: use pagesize for darwin (#84779)
fc71a5befd Added Popen warning to lines.py documentation (#84806)
7fbaf6cfcf dnf5: fix is_installed check for provided packages (#84802)
7e0d8398ff Implement an informative reporter for `resolvelib`

GregSutcliffe avatar Mar 13 '25 16:03 GregSutcliffe

Assuming these repos are only kept around for gathering data, and then being deleted (otherwise we would become a low-fi software heritage clone), i feel like these options cover a pretty good range of tradeoffs to be able to cover different usecases

To this point - no, the data is not deleted by Augur, and it actually gets pretty unhappy if you delete a repo it thinks it cloned already. Our instance is currently holding 2.5T of git data, so you can see why I want to explore reducing that.

GregSutcliffe avatar Mar 13 '25 16:03 GregSutcliffe

The OpenSSF Scorecard, repo_labor code counter, and dependency analysis tasks rely on the cloned repository existing. The consumption of disk space is a scaling issue. I think the answer will be more complicated than we might think. This would be an issue to address as we make Augur cloud native. #1389 Has a punch list for these issues that I will comment on.

sgoggins avatar Mar 18 '25 14:03 sgoggins

should we keep this open while it is still an unresolved issue? it also seemed like there were some viable strategies to reduce the space of cloned repos (with some analysis needed to see whether each of the various metrics actually requires all git objects and/or a full working directory checkout to be present)

MoralCode avatar Mar 18 '25 15:03 MoralCode

I would say this is "possible", however the cost of disk space is minuscule compared to the engineering time it would take to refactor this. We can leave this open, but I would estimate it at months of engineer time because of how many different parts of Augur presently rely on the clone ...

sgoggins avatar Mar 19 '25 00:03 sgoggins

This would be an issue to address as we make Augur cloud native.

sgoggins added wontfix on Mar 18

Is this still planned for a future cloud-native push?

I still think there's still some benefit here that, yes would take a nontrivial chunk of engineering time to audit all the places in the code that touch this, but IMO it could be done in a way that could really help large instances keep storage costs manageable as they scale up.

Greg essentially proposed 3 ways to reduce the space of each repo: --bare, removing objects AKA --filter=blob:none, and --single-branch. Given that it sounds like several processes ("OpenSSF Scorecard, repo_labor code counter, and dependency analysis tasks") all rely on the repo existing (and implicitly having a worktree from that repo in some form) removing objects is basically a non-starter.

Of the two remaining ways (bare clone and single branch mode), I think there could be value in organizing any of the processes that require a worktree such that they create such worktree in a temporary directory using the bare repo (which would have been fetched at collection time).

This way the disk storing bulk repos in augur sees the savings, but there's also a way for processes that need to check out a work tree for analysis of repo contents. I think this could be done as some kind of shared component at the augur level such that each task that needs this worktree can reuse that function, which in turn would ensure a new temporary worktree is only created if it doesn't exist/doesn't match the intended checkout hash (allowing multiple tasks to share a worktree). Should disk space be needed, there could be a mechanism to delete any of these temporary worktrees that aren't currently in use. With the bare repos being stored elsewhere, this deletion would not lose any data that is not stored elsewhere in augur.

--single-branch would need a deeper look and would likely only make sense if all of the analyses being run only look at the default branch.

MoralCode avatar Jul 02 '25 15:07 MoralCode