oshdb icon indicating copy to clipboard operation
oshdb copied to clipboard

Removing old test-files from git-history

Open rtroilo opened this issue 4 years ago • 2 comments

we have same very old and largely test-data files in our git history which lets our repository grow to the current size of 120mb. files in the history like the following could by wiped from the history to reduce our repository size:

  • 46b7725b3c4a 1,5MiB core/oshdb/src/test/resources/data/hosmdb_keytables.mv.db
  • 74bc041433cc 1,5MiB oshpbf-parser/src/test/resources/org/heigit/bigspatialdata/oshpbf/mapreduce/maldives.osh.pbf
  • 2136edb4ea5b 1,6MiB test-data/equatorial-guinea.osh.pbf
  • 700dd9c55ccc 2,0MiB core/oshdb-tool/src/test/resources/maldives.osh.pbf
  • 92d488930d05 2,0MiB test-data/faroe-islands.osh.pbf
  • 8cbb0ad1ec34 3,0MiB test-data/andorra.osh.pbf
  • c41cd0fa27d3 5,5MiB oshdb-api/src/test/resources/update-test-data.mv.db
  • c997c5c33936 5,8MiB oshdb-api/src/test/resources/test-update-data.mv.db
  • 6683c395170b 6,0MiB oshdb-api/src/test/resources/test-update-data.mv.db
  • 6ec7f46aadf7 8,5MiB oshdb-util/src/main/resources/ne_10m_admin_0_map_units/ne_10m_admin_0_map_units.shp
  • 3c699dd29a85 28MiB test-data/kathmandu.osh.pbf
  • 2f67d705dbe9 78MiB core/oshdb/src/test/resources/data/hosmdb_way.mv.db

I used this command from stackoverflow [1] to find those files

git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2 |
  cut -c 1-12,41- |
  $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

A good tool for wiping files from the git history could be

  • https://rtyley.github.io/bfg-repo-cleaner/

What do you think about this?

[1] https://stackoverflow.com/questions/10622179/how-to-find-identify-large-commits-in-git-history

rtroilo avatar Jan 29 '21 15:01 rtroilo

To remove them from master you have to rewrite the whole master history. For a public repo (with releases and forks) this is something which is strongly discouraged, even tough I would like to remove them. I'm pretty torn.

joker234 avatar Jan 29 '21 15:01 joker234

I believe the following 3 could be removed without (big) history-rewriting troubles, since they were not (yet) merged into master:

c41cd0fa27d3 5,5MiB oshdb-api/src/test/resources/update-test-data.mv.db c997c5c33936 5,8MiB oshdb-api/src/test/resources/test-update-data.mv.db 6683c395170b 6,0MiB oshdb-api/src/test/resources/test-update-data.mv.db

For the rest… I don't know. The 100MB+ repo size is not great, but rewriting history of the whole project (incl. all branches) is also quite troublesome.

We could just recommend people to create shallow clones when disk usage or slow connections are an issue (e.g. git clone --depth=1 https://github.com/GIScience/oshdb)?

$ git clone --depth=1 https://github.com/GIScience/oshdb
…
$ du -hs oshdb
7.9M	oshdb

tyrasd avatar Jan 29 '21 15:01 tyrasd