MCDC Large files are accidentally Git-tracked?

Can we use this to remove unwanted files accidentally tracked in the git history?

Jul 25 '24 05:07 ilhamv

Look for __ptxcache__ files .o .ptx files specifically per @braxtoncuneo

Jul 30 '24 20:07 jpmorgan98

You can use du -sh * in a directory for a human-readable list of how large each item in the directory is. The large files all seem to be due to inf_shem361 examples, the answer.h5 and data .npz files.

Possible ways to handle that:

zip all of the inf_shem361 examples and keep them where they are
move the inf_shem361 to a separate repo to use as tests when desired
remove the inf_shem361 examples/tests entirely

Jul 30 '24 20:07 clemekay

The plan is to replace the infinite medium 361-group problem with an infinite medium few-group problem (probably the 7 group c5g7 data).

Aug 14 '24 00:08 ilhamv

The largest memory seems to come from

 68M	.git/objects/b3
142M	.git/objects/pack

Now I'm less sure if the ~4 MB 361-group data is actually the culprit. I'll try to use https://rtyley.github.io/bfg-repo-cleaner/ which may provide us with more info.

Aug 14 '24 07:08 ilhamv

So,,,

Deleted files
-------------

	Filename                             Git id            
	-------------------------------------------------------
	Miniconda3-latest-Linux-ppc64le.sh | cdb26f99 (94.9 MB)
	analytic.zip                       | b3859ac8 (92.5 MB)

Aug 14 '24 07:08 ilhamv

Now the .git/objects folder is 44M. More reasonable!

However, the next step is:

Finally, once you're happy with the updated state of your repo, push it back up (note that because your clone command used the --mirror flag, this push will update all refs on your remote server):

$ git push

At this point, you're ready for everyone to ditch their old copies of the repo and do fresh clones of the nice, new pristine data. It's best to delete all old clones, as they'll have dirty history that you don't want to risk pushing back into your newly cleaned repo.

Any thoughts? @clemekay @jpmorgan98

Aug 14 '24 07:08 ilhamv

We may be able to reduce the size further when we remove the SHEM361 test problems and examples. I'll rerun the repo cleaner. Nevertheless, we still need to think about the final step of the cleaning I mentioned in the previous comment.

Aug 15 '24 03:08 ilhamv

Currently looking into whether we need to use the cleanup function or whether we can just delete these files.

Nov 19 '24 21:11 clemekay

I'll bring this up at this week's dev meeting, but essentially this will require everyone re-cloning the repo once it's cleaned.

Adding this write-up here for myself, really –

A commit is essentially a snapshot of changes that were made to the repository since the last commit. When cloning a repository, you're downloading a copy of the repository as it currently exists as well as the entire commit history.

Say the repository contains example.file with the contents:

This is an example file.

I delete the file and make a commit, which I'll call commit-2. When I clone the repository, its contents do not include example.file. However, the commit history includes commit-2, the under-the-hood content of which is not just "Deleted example.file"; the content is

" Deleted example.file, which contained:

This is an example file.

"

So, even if a large file is added in one commit and then deleted in the next, the commit history of the repository effectively contains a full copy of the file.

Mar 18 '25 00:03 clemekay

Going to use git-filter-repo as it only requires a single python file and has increased functionality over bfg-repo-cleaner.

Just removing Miniconda3-latest-Linux-ppc64le.sh gets us down to ~50 MB. However, I can be more aggressive with removal and get us down to ~8 MB. No matter how aggressive I am with it:

Every developer will need to re-clone the repo because its history will change
The current state of the repo will not change at all

Is there a level of aggressive we do/don't want to be here? It's obviously useful to have the entire true history of the repo available, but it also seems very useful to clear out history of files that haven't existed since 2022.

Thoughts? @ilhamv @jpmorgan98

Mar 18 '25 23:03 clemekay

Thanks, @clemekay .

How much the commit history will change? Will it just remove some commits that are associated with the deleted files, or it will completely reset the history (starting from a freshly initiated repo)?

Mar 19 '25 02:03 ilhamv

@ilhamv Tl;dr: we will keep our current repository and all of the relevant history, but the entire commit history will have new hashes

Under the hood, filtering is essentially re-writing the entire repository history to remove any reference to the filtered files but preserve everything else. It writes new commits, trees, tags, and blobs (that's all git lingo, even blobs) corresponding to (but filtered from) the original objects in the repository, then deletes the original history and leaves only the new.

We get to keep our same repository with all of its branches, tags, version releases etc. with their original dates and they are all automatically updated! For example, if an existing release refers to a specific commit by number, the release will be updated to refer to that commit's new number. The only exception is that if a commit is made entirely empty because of what's been deleted, filtering will delete the now-obsolete commit.

We also keep our existing issues and PRs, but I don't think those get automatically updated if they refer to a specific commit hash. However, we will have a table of old hashes and their corresponding new hash, if we end up needing it.

To show how everything will be unaffected, I did a test:

I made an exact copy of the MCDC repository called mcdc-filter-clone
I made a couple releases that match our most recent ones, including generating changelogs (just like our releases have)
I made my own fork of it (clemekay/mcdc-filter/clone), made a small change, and submitted/approved a PR from my fork to CEMeNT-PSAAP/mcdc-filter-clone
As a reference of what would happen, I also submitted another PR that I left open and created an issue
Then, I used git-filter-repo to remove anything larger than 10MB from CEMeNT-PSAAP/mcdc-filter-clone; this gets the repository down to about 33 MB (from our current 145 MB)
You can see from the screenshot below that when I push the filtered repository back up to CEMeNT-PSAAP/mcdc-filter-clone, the branches and version tags are updated with the new history, but the PRs are left untouched:

You can see in the CEMeNT repo here that things all look functionally the same except for the PR I left open, which has now been automatically closed.

You can also see that my fork is now out of date, because even though it's still functionally the same, the commits all now have different hashes (which is also why the now-closed PR appears to have 1370 commits):

I'll write up some instructions for everyone on how to best update their repositories for different scenarios (local branch, remote branch, uncommitted work, committed work, pushed work, etc) and can help with whatever else comes up!

Mar 19 '25 20:03 clemekay

Thanks for the thorough explanation, @clemekay! Those sound and look good to me.

You mentioned a more aggressive filtering would get us down to ~8 MB. This repo filtering seems to be something that we don't want to do very often, so trying to get the most out of this effort would be great.

Mar 21 '25 06:03 ilhamv

That's a good point, I'll get on that and explain it to everyone at today's meeting!

Mar 21 '25 16:03 clemekay

Waiting on #318 & #316

Apr 02 '25 19:04 clemekay

Repository filtered from ~150 MB to ~30 MB. For instructions on how to get your fork/clone up-to-date, see https://github.com/CEMeNT-PSAAP/MCDC/wiki/Update-all-clones-due-to-git%E2%80%90filter%E2%80%90repo

Apr 18 '25 01:04 clemekay