storage location for golden histograms and logs
We have a functional solution right now, but it has a few issues.
- We have a hard upper limit on the size of our golden files (100MB I believe?).
- The histogram files are binary files and so each new version of them is committed into the
gittree into a fully new object which increases the size of thegitrepo slowing the clone down considerably.
In a perfect world, neither of these issues would be present. The golden files would reside on the machine(s) that run the CI and we would only interact with them directly if something with the CI was going wrong. A normal developer would not need to copy them down to their machine and the only size limit on these golden files would be dictated by the disk space of the machine(s) that run the CI.
This issue is just me documenting some ideas while I'm thinking about it. I doubt that this is high enough priority to resolve anytime soon.
Self-Hosted Runners
The ideal solution is for us to administer the machines that run the CI with the obvious downside that we would need to bear the additional burden of administration. Nevertheless, I still think this is a reasonable solution since we already have experience with administering clusters for large-scale simulation production. In fact, the machine that builds the ldmx/dev images is already a self-hosted runner at UMN.
System Requirements
- Pre-installed Dependencies (
apptainer,just,denv, other helpers...) - Sharing golden files across machines either via a shared filesystem or manual duplication
- Access to update the golden files from the runners themselves (
update-pr-goldjob) - 16+ cores (2 cores per job, allows the PR Validation jobs to be run in parallel)
- 1TiB+ (~64GiB per job is plenty if jobs are sharing image cache and sharing the golden files)
- 32GiB+ (2GiB per job)
Additional Features if Possible
- Local image cache to save time on jobs
dockerinstall for building and pushing the production images- Storage of full golden event files for precise comparison (not just histogram KS test but starting with a full event-by-event diff)
- Local repo cache to save time on jobs (have GitHub's actions/checkout do the checkout logic for us but not remove afterwards? not sure if possible...)
Separate Repo for Golden Files
The only other idea I can think of is putting the golden files into their own git repository which we then only push/pull from in CI when necessary. This would then only waste time moving the large git objects in the CI and not for the normal human developer.
I think having an ldmx-data repository would be great, I could also imagine the LHE files for the signal to be kept there, so we dont need to rerun the MG all the time, even tho that does not change very often (like changes every 5 years when we update the version? lol) + I really wanted some tests to run on test beam data too, same thing with the storage
We should also think about if we want to remove the old logs / golds from the commit history
Pruning the gold histograms and logs is pretty easy with git-filter-repo.
tom@appa:~/code/ldmx$ git clone [email protected]:LDMX-Software/ldmx-sw.git ldmx-sw-fresh
Cloning into 'ldmx-sw-fresh'...
remote: Enumerating objects: 59352, done.
remote: Counting objects: 100% (1675/1675), done.
remote: Compressing objects: 100% (1065/1065), done.
remote: Total 59352 (delta 1161), reused 623 (delta 609), pack-reused 57677 (from 3)
Receiving objects: 100% (59352/59352), 491.68 MiB | 15.19 MiB/s, done.
Resolving deltas: 100% (35935/35935), done.
Updating files: 100% (1163/1163), done.
tom@appa:~/code/ldmx$ git -C ldmx-sw-fresh filter-repo --invert-paths --path-glob '**/gold.root' --path-glob '**/gold.log'
NOTICE: Removing 'origin' remote; see 'Why is my origin removed?'
in the manual if you want to push back there.
(was [email protected]:LDMX-Software/ldmx-sw.git)
Parsed 8913 commits
New history written in 2.41 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
HEAD is now at cacb75b66 add doxygen tag to Python.cxx
Enumerating objects: 58378, done.
Counting objects: 100% (58378/58378), done.
Delta compression using up to 4 threads
Compressing objects: 100% (17487/17487), done.
Writing objects: 100% (58378/58378), done.
Total 58378 (delta 35561), reused 58080 (delta 35269), pack-reused 0
Completely finished after 5.40 seconds.
tom@appa:~/code/ldmx$ du -sh ldmx-sw/.git
913M ldmx-sw/.git
tom@appa:~/code/ldmx$ du -sh ldmx-sw-fresh/.git
232M ldmx-sw-fresh/.git
That's a decrease of almost 4x !!!
The "Why is my origin removed?" section of the manual for git-filter-repo is a good read for us and may helps us decide on next steps for removing the golden files from the history.
That's a decrease of almost 4x !!!
really nice!!
Resolved by #1721 , discussion of pruning the history is moved to #1726