openff-toolkit icon indicating copy to clipboard operation
openff-toolkit copied to clipboard

Reduce size of openforcefield git checkouts by clearing out large deleted files from git history

Open jchodera opened this issue 7 years ago • 13 comments

Checking out openforcefield from github now pulls down over 200 MB in over 5000 files. What happened here?

jchodera avatar May 28 '17 02:05 jchodera

I don't see anything "sudden" which is big; from what I can tell from a cursory look it is a bunch of "modest" sized molecule sets and other test sets (MiniDrugBank, a ZINC subset, etc.) we use to examine coverage, plus the AlkEthOH set and other related data. Most of these are single compressed files containing multiple molecules; I'm not seeing anything obviously bizarre (like 4000 mol2 files for individual molecules or some such). I don't have time to dig more at the moment.

davidlmobley avatar May 28 '17 04:05 davidlmobley

It looks like there's currently ~20M of examples/ and ~100M of utilities/:

lski1962:openforcefield choderaj$ du -sh *
4.0K	LICENSE
8.0K	README.md
 28K	The-SMIRNOFF-force-field-format.md
 28K	devtools
 20M	examples
8.0K	oe_license.txt.enc
 39M	openforcefield
 44K	openforcefield.egg-info
4.0K	rdkit
4.0K	setup.py
 96M	utilities

Lots of these are in utilities/filter_molecule_sets:

lski1962:filter_molecule_sets choderaj$ du -sh *
 28M	DrugBank.sdf
3.1M	DrugBank_CHO_atyped.mol2
 19M	DrugBank_atyped.oeb
2.7M	DrugBank_updated_ff.mol2.gz
2.6M	DrugBank_updated_tripos.mol2.gz
1.5M	MiniDrugBank_ff.mol2
1.1M	MiniDrugBank_ff_withGenerics.mol2
1.5M	MiniDrugBank_tripos.mol2
1.1M	MiniDrugBank_tripos_withGenerics.mol2
8.0K	README.md
4.0K	elements_exclude.txt
2.7M	ff_test.mol2.gz
 12K	filter_molecule_sets.py
 16K	pickMolecules.ipynb
4.0K	remove_smirks_CHO.smarts
4.0K	remove_smirks_simple.smarts
2.6M	tripos_test.mol2.gz
3.9M	updated_DrugBank.mol2.gz

What if we moved some of these larger groups of molecule sets to external repos that we can import for testing?

I'm just concerned that someone who wants to grab the openforcefield toolkit may not need 200M of files just to assign SMIRNOFF parameters.

jchodera avatar May 28 '17 04:05 jchodera

Alternatively, I suppose we could just be parsimonious about what we actually install/package inside of conda packages.

jchodera avatar May 28 '17 04:05 jchodera

It might be me, I think one of the notebook on my branch in /examples/forcefield_modification/ was quite big. I have now shrunk it.

hjuinj avatar May 28 '17 10:05 hjuinj

Seems like the right way to deal with this is to not package/install things which are not needed. Someone who doesn't want to do more than assign parameters doesn't need all the molecule sets which are for testing/development/etc., for example.

davidlmobley avatar May 28 '17 16:05 davidlmobley

My most recent pull request removed most of the files in filter_molecule_sets if that helps.

bannanc avatar Jun 01 '17 18:06 bannanc

I think we might have to clear out those files from the git history: https://stackoverflow.com/questions/2100907/how-to-remove-delete-a-large-file-from-commit-history-in-git-repository

jchodera avatar Jun 01 '17 22:06 jchodera

@t-kimber and I recently used BFG Repo Cleaner (mentioned in the SO question) with great success. It's a Java script, but easy to use.

jaimergp avatar Oct 18 '19 15:10 jaimergp

I've used that before too.

davidlmobley avatar Oct 18 '19 16:10 davidlmobley

This remains a minor nuisance when needing to pip install git+git://github.com/openforcefield/openff-toolkit.git@param-iter since it clones the repo with all of its history and pip does not support shallow cloning. It takes about 2 minutes on my mediocre residential internet, but still about 20-30 seconds on CI machines that probably have 1-10 gigabit connections. Not the slowest step in CI builds but it adds up when all workflows need to run it.

mattwthompson avatar Aug 17 '21 14:08 mattwthompson

You can do pip install https://github.com/openforcefield/openff-toolkit/archive/param-iter.tar.gz I think!

jaimergp avatar Aug 17 '21 14:08 jaimergp

Oh, hey, that works great! I had assumed that GitHub didn't make/have archives for all branches (i.e. feature branches that are not tagged) but I was wrong. That drops the install time to just a few seconds.

mattwthompson avatar Aug 17 '21 15:08 mattwthompson

AFAIK the "filename" can be any git ref, so even hash commits will work.

jaimergp avatar Aug 17 '21 15:08 jaimergp