pharo-vm icon indicating copy to clipboard operation
pharo-vm copied to clipboard

Reduce .git size

Open hogoww opened this issue 3 years ago • 7 comments

By executing git gc --aggressive --prune on the pharoVM repository, we go from 241 to 113 mb of data.
AFAIK, this command is lossless.
The only way I found to do that however is to remove the repo and recreate it, which sucks for many reasons.

What do you think @guillep @tesonep ?
This would speed up clones tremendously already.

hogoww avatar Dec 30 '22 18:12 hogoww

By executing git gc --aggressive --prune on the pharoVM repository, we go from 241 to 113 mb of data. AFAIK, this command is lossless.

I'm interested, I think it's a good idea

The only way I found to do that however is to remove the repo and recreate it, which sucks for many reasons.

I don't grasp the implications of this, could you elaborate? Wouldn't a push force do it? Is there some documentation?

Thanks!

guillep avatar Jan 03 '23 08:01 guillep

The only way I found to do that however is to remove the repo and recreate it, which sucks for many reasons.

I don't grasp the implications of this, could you elaborate?

From my understanding, you'd have to recreate the repo from scratch. Like delete & repush. You'd loose everything github related (issue, PR, others?) Which sounds very annoying

Wouldn't a push force do it? Is there some documentation?

That's actually why I ask, the documentation I found looks unrelated to me. And I hope your expertise might help.

From what I understand, Github is just a remote repository with a pretty interface. So you give it a well formed .git directory, and it creates that pretty interface. However, you cannot run commands on the remote .git. Moreover, what you can push are the commits, not the .git itself. Trying to push after an aggressive GC says that nothing changes. My understanding is that git computes the delta between the local repository and the remote one, and only rewrite on the remote what is needed. Therefore not changing the remote .git.

Am I making sense? ^^"

hogoww avatar Jan 03 '23 09:01 hogoww

Github Documentation (https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository) points to the BFG tool to remove sensitive stuff from a repository's history. Therefore, they seem to be able to force push the .git file. Given what I read there, as you said too, maybe a push force would do it

GitHub Docs
If you commit sensitive data, such as a password or SSH key into a Git repository, you can remove it from the history. To entirely remove unwanted files from a repository's history you can use either the git filter-repo tool or the BFG Repo-Cleaner open source tool.

hogoww avatar Jan 12 '23 07:01 hogoww

Github is responsible for the storage representation of the repository. Since it's closed source, it is difficult to understand how and when they optimize the space. Moreover, the git clone operation is quite complex (in fact, a clone operation is simple, it is a fetch operation on an empty repository, it is the fetch that is complex :p), and therefore I'm not even sure that cloning an optimized repository will create an optimized repository on the client side.

Eons ago, they used to gc the stuff regularly https://twitter.com/githubhelp/status/387926738161774592?lang=en (but without the aggressive flag?)

I think that "removing data" and force pushing won't work here. Since, if I understand correctly, the issue is not to rewrite the history, but having a better packing of the git objects (commits, blobs, trees, etc) where the sha1 will remain unchanged (lossless operation).

While having a slimmed default clone is nice, for the use case where one really wants a fast and small clone, --depth=, --shallow-since and other related options could be a more generic and efficient approach.

Twitter

privat avatar Jan 16 '23 16:01 privat

I contacted the github support, I'll see what they answer, when they answer. In the meantime, thanks @privat that's indeed a very good point. I will modify my repo accordingly, particularly for CI purposes.

hogoww avatar Jan 17 '23 10:01 hogoww

Hi, I got a very nice answer from the support. I'll paraphrase it here. Basically, if you want to run such kinds of commands, you should ask them directly. They don't do it by default, to avoid any weird behavior. They propose to use gitsizer (blogpost) to analyze a given repository.

They ran a cache clearance on the repository already, which reclaimed 20k/160k git objects, and freed a few megabytes. I can try investigating a bit further with gitsizer to see what we can ask them. As I am not in this repository anymore though, I can only provide what you should ask for, and you will have to ask for it. Does that sounds aggregable to you?

hogoww avatar Jan 18 '23 04:01 hogoww

While having a slimmed default clone is nice, for the use case where one really wants a fast and small clone, --depth=, --shallow-since and other related options could be a more generic and efficient approach.

Tried it, but my CI is down for now. But this does not work, because Iceberg expects the full history.

hogoww avatar Jan 22 '23 15:01 hogoww