git-sizer
git-sizer copied to clipboard
Does git-sizer count objects managed by Git LFS?
I have a largish bare repo with Git LFS installed (SVN to Git migration):
proj.git (BARE:master) $ git-sizer
Processing blobs: 1107392
Processing trees: 178226
Processing commits: 29412
Matching commits to trees: 29412
Processing annotated tags: 0
Processing references: 24
| Name | Value | Level of concern |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size | | |
| * Blobs | | |
| * Total size | 12.8 GiB | * |
| | | |
| Biggest objects | | |
| * Trees | | |
| * Maximum entries [1] | 1.96 k | * |
| * Blobs | | |
| * Maximum size [2] | 113 MiB | *********** |
| | | |
| Biggest checkouts | | |
| * Number of directories [3] | 13.3 k | ****** |
| * Maximum path depth [4] | 18 | * |
| * Maximum path length [5] | 232 B | ** |
| * Number of files [6] | 910 k | ****************** |
| * Total size of files [7] | 3.37 GiB | *** |
I've written a little git lfs ls-file helper git_lfs_calculate_size_by_type.py which reports for proj.git repo this:
Git LFS objects summary:
.lib: count: 1111 size: 8764.66 MB
.dll: count: 749 size: 1427.98 MB
.pdb: count: 612 size: 2814.09 MB
.exe: count: 786 size: 2005.72 MB
.zip: count: 24 size: 1153.65 MB
Total: count: 3282 size: 16166.11 MB
Does the latter 16166.11 MB relate to the former 12.8 GiB in any way?
Or, is the grand total of the repo, Git and Git LFS objects, a sum of the two figure?
Git Sizer does not do this, but I think that it would be neat if it did. @mhagger: do you agree?
@ttaylorr Thanks for answering my question. It would be neat if it did, indeed.
I agree that this would be neat, with one proviso: either we should prove using benchmarks that this feature is not too costly, or we should make it possible to turn it on/off via command-line options. (Currently, git-sizer never has to open up any blob files, but if this feature were implemented, as I understand it, it would have to open and parse any blob files smaller than some limit, correct?)
I'm new to Git LFS, but AFAIU, it would have to open each pointer file and parse size key. An option sounds perfect.
(Currently,
git-sizernever has to open up any blob files, but if this feature were implemented, as I understand it, it would have to open and parse any blob files smaller than some limit, correct?)
Right, we'll have to inflate blobs, but I don't think that we have to do so based on size, if I'm understanding correctly. Git LFS only watches files which match patterns given in any .gitattributes in a parent directory, so at worst we'll have to open up a blob, but at best we'll only match its path in the tree.
Some code that already exists to that end:
-
https://github.com/git-lfs/git-lfs/blob/v2.6.1/lfs/pointer.go#L100-L103, which will decode a pointer given an
io.Reader(there's an additional variant that will stop after realizing that the data included doesn't possible contain a pointer, and will return anio.Readercontaining the rest of the contents). -
https://github.com/git-lfs/wildmatch/blob/v1.0.0/wildmatch.go#L172-L180, which will return a boolean indicating whether or not a given path was matched by a pattern (using the same semantics as Git uses to match entries in
.gitattributes)- https://github.com/git-lfs/git-lfs/blob/v2.6.1/git/gitattr/attr.go, which will return a set of wildmatch patterns given a
.gitattributes(or parse a whole slew of them, if your in a directory with >= 1 parents). Also supports macros, etc.
- https://github.com/git-lfs/git-lfs/blob/v2.6.1/git/gitattr/attr.go, which will return a set of wildmatch patterns given a
The entire point of git lfs is to be able to not care about the size of LFS files except at HEAD.
- Given that you've already got the list of blobs and their sizes from previous steps, you can make a guess at which ones might be
lfswithout touching the filesystem any further, because their blob sizes will all be roughly in the range of 100-200 bytes. - You can pretty quickly determine which of those are actually stored in
lfsusinggit check-attr filteron them (or as mentioned above just parse the.gitattributesfiles yourself in go, without the subprocess, though you probably wantgit check-attr --cached, which is likely faster in many cases since it uses the index). git lfs ls-files -s -Iis probably the most "correct" way to get the size for those objects, though if you're going by path then you have to go one invocation per path. Except,git lfswas written in go, so you can just usegithub.com/git-lfs/git-lfs/lfs.GitScannerwith appropriate filters so it doesn't waste time searching paths you already know don't have anything interesting.
git-sizer usually doesn't know the path associated with a particular blob, and indeed that's a feature, not a bug. Why? In certain types of pathological repositories like git bombs, the same blob is repeated over and over at an astronomical number of different paths. git-sizer goes to great lengths to be immune to pathological Git repositories, always scaling like the number of objects in the Git object database rather than like the actual size of checked-out trees or whatever. So I'd be reluctant to add any features that require paths for blobs. If we start doing gitattribute checks, then I think we'd need those paths.
One could imagine skipping those gitattribute checks and instead deciding which items are LFS pointer files based only on their lengths and contents. This would probably be a nearly perfect approximation, and I think that it could be implemented in a way that isn't extravagantly expensive. (We'd still want it to be optional, though.)
If you want an exact count including gitattribute checks, then you're probably better off asking the LFS project for such a tool, if they don't already have one.
git lfs ls-files -s will tell you the sizes of all the lfs files. But I'm not even sure what git-sizer would do with it; storing large files in git is a bad idea, so git warns you about it, but storing them in LFS is precisely what LFS is for, so what would you warn about?
I feel like anything that actually inspects the content of files is going to be prohibitively slow. Probably the most useful thing it could do would actually be to just get the total number/size of objects in .git/lfs/objects. That would be relatively quick, I think, though the results would be highly dependent on which commits have ever been checked out. The second-most-useful thing I can think of would be to flag any very small objects in there, e.g. <100 bytes, at which point the stub file in the git index actually takes up more space than it would have taken to just put the file in git without the lfs, making the use of LFS counterproductive (I've definitely seen people using sloppy glob patterns to put even empty file into LFS).