Support for compressing files in the lfs folder?
We have several .dds/.tga texture files that compress very well. can we add a field to the pointer file that specifies a compression scheme?
so basically we'd get:
version https://git-lfs.github.com/spec/v1
oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
size 12345
scheme: zlib
Oh and sorry.. to clarify, if the scheme is set to zlib, or some other compression library, obviously the file in the lfs folder is compressed using that scheme, and probably is that way on the server as well. the sha256 is probably that of the compressed data...
We may not even need to add this to the pointer files if all we're doing is compressing them in the .git/lfs/objects directory. As long as we can verify that the inflated object still matches the OID, we should be fine storing the compressed version there, and unzipping it in the git lfs smudge command.
Maybe even store the file with an extension, so the smudge command can tell if the file is compressed or packaged in any way, like:
.git/lfs/objects/ab/cd/some-oid.gz
Storing compressed objects on the server is tricky though. Is zlib fully deterministic? It wouldn't be a good idea to go with a non-deterministic compression algorithm, otherwise slight changes between Git LFS clients or compression libs can cause the pointer file to change without changing the contents of the actual file.
We can also probably add gzip encoding to the API somehow to save on transfer times.
Thanks for filing this, I didn't know about these common binary files that are very compressible.
I see, so maybe the sha256 needs to be calculated pre-compression? I'm not sure if zlib is fully deterministic although I can't really imagine why not.. but from version to version things might change a little, so that would still cause problems.... doing the sha256 pre-compression would bypass that problem though.
that said, a lot of content/media files are in fact just raw data. Some are .dll or .exe files which typically also compress 30-40%, so there is big savings to be had by compressing the on disk representation.
I really like the idea of compressing objects in .git/lfs, if they're a format where compression makes sense. Compression for the transfers also makes good sense. For server storage, though, I think it makes more sense to leave that up to the server implementation and keep that out of the api and pointer files.
my idea for adding it to the pointer was that in the future it could add 'other' compression schemes without breaking backwards compatibility... But technoweenie's idea of just adding a .gz extension achieves the same result, and is probably easier to implement.
Compression for the transfers also makes good sense.
This may be tricky for Git LFS implementations that directly use S3. But that's fine. Progressive enhancement is a big goal with the Git LFS API. We may need to figure out a way for the server to tell the client that it's ok to send the content gzipped. Downloads can make use of the Accept-Encoding header.
I'm not sure if zlib is fully deterministic although I can't really imagine why not..
I'm certainly no expert on compression libs, but I think they're mostly non-deterministic, and probably for speed. As long as compatible tools can read/write the content, and the content is preserved, determinism isn't important for a lot of compression uses.
This lzham algorithm has a deterministic mode though:
Supports fully deterministic compression (independent of platform, compiler, or optimization settings), or non-deterministic compression at higher performance.
Another data point is that we get about 80% savings when compressing .pdb files.
It seems like there are 2 ways to go here.
- Keep the pointer file in canonical form and store ONLY the {SHA, size} of the original uncompressed media file in the pointer file.
- Add compression information to the pointer file.
The first way keeps the overall commit SHA normalized and independent of the compression results.
If we then let the pre-push hook do the compression, the upload POST can include the original {SHA, size} plus some scheme info, such as {scheme-name, compressed-SHA, compressed-size}. And then have the server store all 5 fields.
The server would NOT need to understand the compression schemes; but it could "shasum" verify the compressed-SHA/compressed-size of the uploaded data if it wanted to.
The server could then always address the file(s) by the original content SHA. If different clients have different compression schemes, the server could allow each uniquely compressed version to be uploaded. (And only give a 202 if "SHA.scheme-name" or "SHA.compressed-SHA" is not present, for example.)
A subsequent GET API operation on the SHA could return an augmented "_links" section with a "download" section for each compressed variant that the server has. The client would be free to choose which variant to actually download (based upon "scheme-name" or "compressed-size" or whatever).
This would also let the server, if it wanted to, do a single server-side compression for raw data (filling in the 5 fields for the server-created variant, if you will). This would avoid the need/expense of doing the compression on the fly for every request.
The problem with storing compression info in the pointer is that most compression algorithms are non-deterministic. Slight variations in compression algorithms could change the compression info in the pointer, causing it to change for different users even if the content doesn't change.
I think we can almost support server side compression now. There's no requirement telling how servers should store data, just that they should accept and serve objects according to the SHA-256 signature of the original data. I think it would just have to signal to the client that it accepts compressed contents (similar to Accept-Encoding, but in the _links hypermedia section).
Yeah, I think it's better to not store compression info in the pointer file. Keep them normalized to reflect the original media file.
Should this request be broken into 3 parts?
- file compression in the server side lfs directory
- compressed data transfer 'on the wire' between client and server
- file compression in the client side lfs directory
In my view all 3 are useful, but discrete, enhancement requests. Each has benefits and costs Disk space is (relatively) cheap these days and my personal interest is in (2) as I'm stuck with a relatively slow link to the server, and many highly compressable, lfs tracked, files. But I can see value in (1) and (3) too. I'd argue that the OID should be of the uncompressed data - fsck would have to uncompress each file to validate the OID - but the OID would be constant no matter which combination of the above 3 options a given client-server pair implemented.
For the above options:
- the server can use a compression algorithm of its choice.
- client and server must, obviously, both agree on the compression algorithm. Both must be able to convert from the mutually agreed compression algorithm to/from their local lfs directory compression algorithm.
- client can use a compression algorithm of its choice.
where compression algorithm may be no compression in any of the above. Clearly using the same compression algorithm at each stage would give performance benefits and keep the implementation simpler.
I don't believe that the above would require a deterministic compression algorithm either - we are always storing the OID of the uncompressed data
Different users may choose different combinations. e.g. I may choose compression for 1 & 2 but no compression for 3 on my desktop (to give faster checkout say) yet choose compression for 1,2 & 3 on my laptop with less disk space (and checkout will take longer).
+1 A compression of any kind would be nice. We have in our repos .sql files with a few hundred mb's. With a compression the size could be reduced about 95%. Is somebody currently working on this feature? Greetings Jan
For the client/server storage piece, is this something that could be delegated to a filesystem that supports transparent compression? Or is that just passing the buck and would generate other performance issues? ZFS, BTRFS, NTFS??
Maybe it would depend on the filesystem? We could compress the data before sending it over the wire, and store it compressed at the long end, but that makes doing binary diffing tricky and byte ranges unrealistic.
A following diff is certainly a problem, but sending the data compressed should be default feature. In my opinion upload speed vs. data storage is the point. Sending an uncompressed 1gb file over the wire is a lot slower (depending on the connection) than an 50 mb file. By compressing the files (or a selected group like *.sql), the lfs storage would provide us an alternative. Is it possible to pack the files just for the upload and unpack it on the server side? Would that be a compromise?
I think that feature would make so much sense, I'm actually very surprised this isn't in LFS yet.
+1 ;-)
Adding this to the roadmap.
nice ;-)
I agree with an above comment by @stevenyoungs that this issue could do with decomposition for the roadmap.
I am also cautious about implementing LFS-specific compression for both disk-storage and over-the-wire transfers, if it can be shown that there are reasonable options available for having underlying infrastructure or protocols provide this: filesystems with compression support and protocols with deflate support, HTTP.
The link to the roadmap in @technoweenie is dead, and I can't find anything about compression in git-lfs. What is the status of this feature?
I don't think we currently have a plan to implement it. There are extensions which could be used in this case, but it would of course require a deterministic implementation.
I'll reopen this as a way for us to keep track of it.
I don't think we currently have a plan to implement it. There are extensions which could be used in this case, but it would of course require a deterministic implementation.
I'll reopen this as a way for us to keep track of it.
what do You mean by: "There are extensions which could be used in this case...", kinda there are something which can be used right now?
Currently I do not care about space saving in both side (as space are cheap), but my biggest pain are transfer time from server to client.
what do You mean by: "There are extensions which could be used in this case...", kinda there are something which can be used right now?
Currently I do not care about space saving in both side (as space are cheap), but my biggest pain are transfer time from server to client.
Git LFS has an extension mechanism which allows for users to specify other filter mechanisms on top of the LFS one. I'm not aware, however, of any tooling that performs this extra filtering already, and it would be necessary to have a deterministic implementation so that the blob didn't get rewritten differently every time.
what do You mean by: "There are extensions which could be used in this case...", kinda there are something which can be used right now? Currently I do not care about space saving in both side (as space are cheap), but my biggest pain are transfer time from server to client.
Git LFS has an extension mechanism which allows for users to specify other filter mechanisms on top of the LFS one. I'm not aware, however, of any tooling that performs this extra filtering already, and it would be necessary to have a deterministic implementation so that the blob didn't get rewritten differently every time.
Thanks for info! (y). Is it reasonable if we could split over-the-wire transfer (as described @stevenyoungs ) in separate enhancement issue? Thing is that for me that would be enough, and it wouldn't require 3rd party extension (which could be dangerous for repo). Gains would be:
- it would't require messing around with how lfs handle files
- it will be immune to non-deterministic compression algorithm
- it's already http rfc2616 so that would simply add additional http capability to lfs file transfer
From how I see that could be implemented is that we simply specify file extensions which can be compressed
Feel free to open a new issue for the transport level compression.
Feel free to open a new issue for the transport level compression.
Thanks! Created issue #3683
Not having built-in support for compression in Git LFS is a real drawback IMO considering many large files do compress very well e.g. libraries, game assets (like geometry), databases, etc... When you have a few GB of LFS files, it makes a real difference.
I spent quite a bit of time experimenting with Git LFS Extensions to add compression:
[lfs "extension.gzip"]
clean = gzip -n --fast
smudge = gunzip
priority = 0
That sounds like a simple solution that ought to work out of the box on Linux, macOS and other *nix - for Windows, people can always use Windows Subsystem for Linux. Seems like only gzip is sufficiently popular to be installed by default on all these platforms.
There are likely 1000+ compressed LFS objects in my test repo. I just discovered that for 3 of them, the Git LFS pointer differs when generated on macOS vs Linux. It's because the underlying call to gzip doesn't return the same output. It works for 99.9% of the files except these particular 3!
So using gzip goes out the window and I can't think of any alternative that works out of the box.
In any case, after a few days of testing, using a Git LFS extension is also impractical because users have to pay attention to clone with GIT_LFS_SKIP_SMUDGE=1, then edit the .gif/config and finally check out master, otherwise everything fails with obscure errors.
TLDR:
- Compression is absolutely valuable for HTTP transport and also, but not as important, for on-disk (there's
git lfs dedupthat's even better for on-disk - In theory this can be implemented as an addition to Git LFS but in practice, it's not usable this way
- Compression must be built-in in Git LFS and the simplest solution would be to add it when "cleaning" and "smudging" (turned on with an extra Git attribute?) as it magically applies to on-disk and HTTP transport without having to update protocols and servers
Would I be wrong in interpreting the above comment as: if we had a platform-agnostic, deterministic compression tool that we could ship alongside the installation of the lfs client binary, this feature would be trivial to support? (E.g. a binary named "lfs-gzip" based on a popular cross-platform implementation?)
That said there might be dangers in using gzip, as it isn't guaranteed to be deterministic in compression, only in decompression? https://unix.stackexchange.com/a/570554
I'm leaning towards using a different algorithm, one that would compress but is also deterministic somehow. But having just done some quick Google searches, I'm not seeing any popular algorithms or implementations specifically designed to be deterministic.
We could probably build or choose our own implementation and if cross-platform enough, call it deterministic as it's the only implementation in use. Pako is a JS re-implementation that would be cross-platform, Go would probably also have an implementation with very few platform dependencies to get in the way. Or we could always pick a particular gzip implementation and ship cross-platform builds from it, that way the likelihood of getting a different result on a different platform is greatly reduced.
I'm not sure there's a way to provide a foolproof guarantee that the same inputs produce the same results unless we manually review the compression algorithm and its implementation for likely non-deterministic behaviours though.
I'll ignore for a moment that all programs execute in a non-deterministic fashion due to the many possible errors and variances that could occur, as that isn't really practical to consider here. After all, if the result differs one time in a billion, all that happens is additional data is stored... right? Unless the gzip process corrupts the data in storage, but we could maybe optionally add a validation step after compression. That might be a good option to provide, if not already built in to a particular gzip binary.