Is it possible to get the necessary git information without storing the entire repository
@GregSutcliffe has a hypothesis to be tested further that cloning the entire repository might not be necessary with the potential of saving a lot of storage space
git clone --baremay be a path forward on disk space, and may perhaps protect you against history rewrites as well
More investigation is necessary to see if this is a viable path
@sgoggins would you be able to assign this issue to @GregSutcliffe ?
@cdolfi : I invited @GregSutcliffe to the Augur repository, which GitHub makes me do before assigning him stuff.
I've accepted. I did a bit of work looking into bare clones a while back, which seems like the way forward, but it needs confirming & testing.
Hi @GregSutcliffe,
Thanks for taking this on! I’d love to contribute to this effort as well. Could you share any insights or progress you’ve made so far regarding the investigation into bare clones? This would help me get up to speed and start exploring ways to extend or build upon your work.
Looking forward to collaborating on this! cc: @sgoggins @cdolfi @ElizabethN
@cdolfi what information is needed? i have done some random things with regard to checking out parts of a repo before, notably by cloning a repo but filtering out objects. this would preserve the file structure (tree) and all the commits, but largely remove all the file contents.
example: git clone --filter=blob:none --no-checkout https://android.googlesource.com/platform/manifest
afaik a bare clone is essentially just the contents of .git, which i believe still stores the full repositoriy's worth of objects (ignoring LFS/submodules) and may only save a little space compared to the filtered method above
other strategies (which dont produce not-fully-usable) repositories include:
- shallow clone: https://git-scm.com/docs/git-clone#Documentation/git-clone.txt-code--depthltdepthgtcode (clones only the most recent N commits, could be useful if the a recent snapshot of the code itself needs analyzing while avoiding extra overhead)
- shallow until and shallow since (the options immediately after the above in the manpage) for doing it by date, this could be useful for only cloning a repo's commits that occurred since the date of a previous collection
- there should also be an option to limit it by branches i think
Assuming these repos are only kept around for gathering data, and then being deleted (otherwise we would become a low-fi software heritage clone), i feel like these options cover a pretty good range of tradeoffs to be able to cover different usecases
@MoralCode you're not wrong, but you're not going far enough :)
If you combine a --bare checkout with other options, you can get a lot further. Notably, we only care about the git history, not the files themselves, and we can filter them out using --filter-blob:none. You can also go further and try adding --single-branch since we only look at the default branch anyway.
Here's some test results on ansible/ansible (a decent size repo)
$ git clone https://github.com/ansible/ansible ansible-full
$ git clone --bare https://github.com/ansible/ansible ansible-bare
$ git clone --bare --filter=blob:none https://github.com/ansible/ansible ansible-filter
$ git clone --bare --filter=blob:none --single-branch https://github.com/ansible/ansible ansible-filter-single
$ du -sh *
301M ansible-full
263M ansible-bare
96M ansible-filter
71M ansible-filter-single
So yes, --bare isn;t a lot - just 13% - but the full effect is 74% with all the options. And git log appears to work fine:
$ cd anible-filter-single && git log --oneline
df08ed3ef3 🔥 Remove Python 2 datetime compat fallbacks
50b4e0d279 facts: use pagesize for darwin (#84779)
fc71a5befd Added Popen warning to lines.py documentation (#84806)
7fbaf6cfcf dnf5: fix is_installed check for provided packages (#84802)
7e0d8398ff Implement an informative reporter for `resolvelib`
Assuming these repos are only kept around for gathering data, and then being deleted (otherwise we would become a low-fi software heritage clone), i feel like these options cover a pretty good range of tradeoffs to be able to cover different usecases
To this point - no, the data is not deleted by Augur, and it actually gets pretty unhappy if you delete a repo it thinks it cloned already. Our instance is currently holding 2.5T of git data, so you can see why I want to explore reducing that.
The OpenSSF Scorecard, repo_labor code counter, and dependency analysis tasks rely on the cloned repository existing. The consumption of disk space is a scaling issue. I think the answer will be more complicated than we might think. This would be an issue to address as we make Augur cloud native. #1389 Has a punch list for these issues that I will comment on.
should we keep this open while it is still an unresolved issue? it also seemed like there were some viable strategies to reduce the space of cloned repos (with some analysis needed to see whether each of the various metrics actually requires all git objects and/or a full working directory checkout to be present)
I would say this is "possible", however the cost of disk space is minuscule compared to the engineering time it would take to refactor this. We can leave this open, but I would estimate it at months of engineer time because of how many different parts of Augur presently rely on the clone ...
This would be an issue to address as we make Augur cloud native.
sgoggins added
wontfixon Mar 18
Is this still planned for a future cloud-native push?
I still think there's still some benefit here that, yes would take a nontrivial chunk of engineering time to audit all the places in the code that touch this, but IMO it could be done in a way that could really help large instances keep storage costs manageable as they scale up.
Greg essentially proposed 3 ways to reduce the space of each repo: --bare, removing objects AKA --filter=blob:none, and --single-branch. Given that it sounds like several processes ("OpenSSF Scorecard, repo_labor code counter, and dependency analysis tasks") all rely on the repo existing (and implicitly having a worktree from that repo in some form) removing objects is basically a non-starter.
Of the two remaining ways (bare clone and single branch mode), I think there could be value in organizing any of the processes that require a worktree such that they create such worktree in a temporary directory using the bare repo (which would have been fetched at collection time).
This way the disk storing bulk repos in augur sees the savings, but there's also a way for processes that need to check out a work tree for analysis of repo contents. I think this could be done as some kind of shared component at the augur level such that each task that needs this worktree can reuse that function, which in turn would ensure a new temporary worktree is only created if it doesn't exist/doesn't match the intended checkout hash (allowing multiple tasks to share a worktree). Should disk space be needed, there could be a mechanism to delete any of these temporary worktrees that aren't currently in use. With the bare repos being stored elsewhere, this deletion would not lose any data that is not stored elsewhere in augur.
--single-branch would need a deeper look and would likely only make sense if all of the analyses being run only look at the default branch.