git-lfs icon indicating copy to clipboard operation
git-lfs copied to clipboard

Support for compressing files in the lfs folder?

Open tvandijck opened this issue 10 years ago • 44 comments

We have several .dds/.tga texture files that compress very well. can we add a field to the pointer file that specifies a compression scheme?

so basically we'd get:

version https://git-lfs.github.com/spec/v1
oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
size 12345
scheme: zlib

tvandijck avatar Apr 23 '15 21:04 tvandijck

Oh and sorry.. to clarify, if the scheme is set to zlib, or some other compression library, obviously the file in the lfs folder is compressed using that scheme, and probably is that way on the server as well. the sha256 is probably that of the compressed data...

tvandijck avatar Apr 23 '15 21:04 tvandijck

We may not even need to add this to the pointer files if all we're doing is compressing them in the .git/lfs/objects directory. As long as we can verify that the inflated object still matches the OID, we should be fine storing the compressed version there, and unzipping it in the git lfs smudge command.

Maybe even store the file with an extension, so the smudge command can tell if the file is compressed or packaged in any way, like:

.git/lfs/objects/ab/cd/some-oid.gz

technoweenie avatar Apr 23 '15 22:04 technoweenie

Storing compressed objects on the server is tricky though. Is zlib fully deterministic? It wouldn't be a good idea to go with a non-deterministic compression algorithm, otherwise slight changes between Git LFS clients or compression libs can cause the pointer file to change without changing the contents of the actual file.

We can also probably add gzip encoding to the API somehow to save on transfer times.

Thanks for filing this, I didn't know about these common binary files that are very compressible.

technoweenie avatar Apr 23 '15 22:04 technoweenie

I see, so maybe the sha256 needs to be calculated pre-compression? I'm not sure if zlib is fully deterministic although I can't really imagine why not.. but from version to version things might change a little, so that would still cause problems.... doing the sha256 pre-compression would bypass that problem though.

that said, a lot of content/media files are in fact just raw data. Some are .dll or .exe files which typically also compress 30-40%, so there is big savings to be had by compressing the on disk representation.

tvandijck avatar Apr 24 '15 14:04 tvandijck

I really like the idea of compressing objects in .git/lfs, if they're a format where compression makes sense. Compression for the transfers also makes good sense. For server storage, though, I think it makes more sense to leave that up to the server implementation and keep that out of the api and pointer files.

rubyist avatar Apr 24 '15 14:04 rubyist

my idea for adding it to the pointer was that in the future it could add 'other' compression schemes without breaking backwards compatibility... But technoweenie's idea of just adding a .gz extension achieves the same result, and is probably easier to implement.

tvandijck avatar Apr 24 '15 14:04 tvandijck

Compression for the transfers also makes good sense.

This may be tricky for Git LFS implementations that directly use S3. But that's fine. Progressive enhancement is a big goal with the Git LFS API. We may need to figure out a way for the server to tell the client that it's ok to send the content gzipped. Downloads can make use of the Accept-Encoding header.

I'm not sure if zlib is fully deterministic although I can't really imagine why not..

I'm certainly no expert on compression libs, but I think they're mostly non-deterministic, and probably for speed. As long as compatible tools can read/write the content, and the content is preserved, determinism isn't important for a lot of compression uses.

This lzham algorithm has a deterministic mode though:

Supports fully deterministic compression (independent of platform, compiler, or optimization settings), or non-deterministic compression at higher performance.

technoweenie avatar Apr 24 '15 15:04 technoweenie

Another data point is that we get about 80% savings when compressing .pdb files.

jeffhostetler avatar Apr 29 '15 14:04 jeffhostetler

It seems like there are 2 ways to go here.

  1. Keep the pointer file in canonical form and store ONLY the {SHA, size} of the original uncompressed media file in the pointer file.
  2. Add compression information to the pointer file.

The first way keeps the overall commit SHA normalized and independent of the compression results.

If we then let the pre-push hook do the compression, the upload POST can include the original {SHA, size} plus some scheme info, such as {scheme-name, compressed-SHA, compressed-size}. And then have the server store all 5 fields.

The server would NOT need to understand the compression schemes; but it could "shasum" verify the compressed-SHA/compressed-size of the uploaded data if it wanted to.

The server could then always address the file(s) by the original content SHA. If different clients have different compression schemes, the server could allow each uniquely compressed version to be uploaded. (And only give a 202 if "SHA.scheme-name" or "SHA.compressed-SHA" is not present, for example.)

A subsequent GET API operation on the SHA could return an augmented "_links" section with a "download" section for each compressed variant that the server has. The client would be free to choose which variant to actually download (based upon "scheme-name" or "compressed-size" or whatever).

This would also let the server, if it wanted to, do a single server-side compression for raw data (filling in the 5 fields for the server-created variant, if you will). This would avoid the need/expense of doing the compression on the fly for every request.

jeffhostetler avatar Apr 29 '15 15:04 jeffhostetler

The problem with storing compression info in the pointer is that most compression algorithms are non-deterministic. Slight variations in compression algorithms could change the compression info in the pointer, causing it to change for different users even if the content doesn't change.

I think we can almost support server side compression now. There's no requirement telling how servers should store data, just that they should accept and serve objects according to the SHA-256 signature of the original data. I think it would just have to signal to the client that it accepts compressed contents (similar to Accept-Encoding, but in the _links hypermedia section).

technoweenie avatar Apr 29 '15 15:04 technoweenie

Yeah, I think it's better to not store compression info in the pointer file. Keep them normalized to reflect the original media file.

jeffhostetler avatar Apr 30 '15 01:04 jeffhostetler

Should this request be broken into 3 parts?

  1. file compression in the server side lfs directory
  2. compressed data transfer 'on the wire' between client and server
  3. file compression in the client side lfs directory

In my view all 3 are useful, but discrete, enhancement requests. Each has benefits and costs Disk space is (relatively) cheap these days and my personal interest is in (2) as I'm stuck with a relatively slow link to the server, and many highly compressable, lfs tracked, files. But I can see value in (1) and (3) too. I'd argue that the OID should be of the uncompressed data - fsck would have to uncompress each file to validate the OID - but the OID would be constant no matter which combination of the above 3 options a given client-server pair implemented.

For the above options:

  1. the server can use a compression algorithm of its choice.
  2. client and server must, obviously, both agree on the compression algorithm. Both must be able to convert from the mutually agreed compression algorithm to/from their local lfs directory compression algorithm.
  3. client can use a compression algorithm of its choice.

where compression algorithm may be no compression in any of the above. Clearly using the same compression algorithm at each stage would give performance benefits and keep the implementation simpler.

I don't believe that the above would require a deterministic compression algorithm either - we are always storing the OID of the uncompressed data

Different users may choose different combinations. e.g. I may choose compression for 1 & 2 but no compression for 3 on my desktop (to give faster checkout say) yet choose compression for 1,2 & 3 on my laptop with less disk space (and checkout will take longer).

stevenyoungs avatar Jan 29 '16 10:01 stevenyoungs

+1 A compression of any kind would be nice. We have in our repos .sql files with a few hundred mb's. With a compression the size could be reduced about 95%. Is somebody currently working on this feature? Greetings Jan

jg-development avatar Jun 05 '16 22:06 jg-development

For the client/server storage piece, is this something that could be delegated to a filesystem that supports transparent compression? Or is that just passing the buck and would generate other performance issues? ZFS, BTRFS, NTFS??

javabrett avatar Jun 05 '16 22:06 javabrett

Maybe it would depend on the filesystem? We could compress the data before sending it over the wire, and store it compressed at the long end, but that makes doing binary diffing tricky and byte ranges unrealistic.

ttaylorr avatar Jun 05 '16 23:06 ttaylorr

A following diff is certainly a problem, but sending the data compressed should be default feature. In my opinion upload speed vs. data storage is the point. Sending an uncompressed 1gb file over the wire is a lot slower (depending on the connection) than an 50 mb file. By compressing the files (or a selected group like *.sql), the lfs storage would provide us an alternative. Is it possible to pack the files just for the upload and unpack it on the server side? Would that be a compromise?

jg-development avatar Jun 11 '16 10:06 jg-development

I think that feature would make so much sense, I'm actually very surprised this isn't in LFS yet.

reinsch82 avatar Jul 21 '16 09:07 reinsch82

+1 ;-)

jg-development avatar Jul 21 '16 20:07 jg-development

Adding this to the roadmap.

technoweenie avatar Aug 11 '16 22:08 technoweenie

nice ;-)

jg-development avatar Aug 11 '16 23:08 jg-development

I agree with an above comment by @stevenyoungs that this issue could do with decomposition for the roadmap.

I am also cautious about implementing LFS-specific compression for both disk-storage and over-the-wire transfers, if it can be shown that there are reasonable options available for having underlying infrastructure or protocols provide this: filesystems with compression support and protocols with deflate support, HTTP.

javabrett avatar Aug 25 '16 00:08 javabrett

The link to the roadmap in @technoweenie is dead, and I can't find anything about compression in git-lfs. What is the status of this feature?

glehmann avatar May 15 '19 14:05 glehmann

I don't think we currently have a plan to implement it. There are extensions which could be used in this case, but it would of course require a deterministic implementation.

I'll reopen this as a way for us to keep track of it.

bk2204 avatar May 15 '19 15:05 bk2204

I don't think we currently have a plan to implement it. There are extensions which could be used in this case, but it would of course require a deterministic implementation.

I'll reopen this as a way for us to keep track of it.

what do You mean by: "There are extensions which could be used in this case...", kinda there are something which can be used right now?

Currently I do not care about space saving in both side (as space are cheap), but my biggest pain are transfer time from server to client.

Rublis avatar Jun 12 '19 10:06 Rublis

what do You mean by: "There are extensions which could be used in this case...", kinda there are something which can be used right now?

Currently I do not care about space saving in both side (as space are cheap), but my biggest pain are transfer time from server to client.

Git LFS has an extension mechanism which allows for users to specify other filter mechanisms on top of the LFS one. I'm not aware, however, of any tooling that performs this extra filtering already, and it would be necessary to have a deterministic implementation so that the blob didn't get rewritten differently every time.

bk2204 avatar Jun 12 '19 13:06 bk2204

what do You mean by: "There are extensions which could be used in this case...", kinda there are something which can be used right now? Currently I do not care about space saving in both side (as space are cheap), but my biggest pain are transfer time from server to client.

Git LFS has an extension mechanism which allows for users to specify other filter mechanisms on top of the LFS one. I'm not aware, however, of any tooling that performs this extra filtering already, and it would be necessary to have a deterministic implementation so that the blob didn't get rewritten differently every time.

Thanks for info! (y). Is it reasonable if we could split over-the-wire transfer (as described @stevenyoungs ) in separate enhancement issue? Thing is that for me that would be enough, and it wouldn't require 3rd party extension (which could be dangerous for repo). Gains would be:

  • it would't require messing around with how lfs handle files
  • it will be immune to non-deterministic compression algorithm
  • it's already http rfc2616 so that would simply add additional http capability to lfs file transfer

From how I see that could be implemented is that we simply specify file extensions which can be compressed

Rublis avatar Jun 12 '19 14:06 Rublis

Feel free to open a new issue for the transport level compression.

bk2204 avatar Jun 12 '19 14:06 bk2204

Feel free to open a new issue for the transport level compression.

Thanks! Created issue #3683

Rublis avatar Jun 12 '19 17:06 Rublis

Not having built-in support for compression in Git LFS is a real drawback IMO considering many large files do compress very well e.g. libraries, game assets (like geometry), databases, etc... When you have a few GB of LFS files, it makes a real difference.

I spent quite a bit of time experimenting with Git LFS Extensions to add compression:

[lfs "extension.gzip"]
	clean = gzip -n --fast
	smudge = gunzip
	priority = 0

That sounds like a simple solution that ought to work out of the box on Linux, macOS and other *nix - for Windows, people can always use Windows Subsystem for Linux. Seems like only gzip is sufficiently popular to be installed by default on all these platforms.

There are likely 1000+ compressed LFS objects in my test repo. I just discovered that for 3 of them, the Git LFS pointer differs when generated on macOS vs Linux. It's because the underlying call to gzip doesn't return the same output. It works for 99.9% of the files except these particular 3!

So using gzip goes out the window and I can't think of any alternative that works out of the box.

In any case, after a few days of testing, using a Git LFS extension is also impractical because users have to pay attention to clone with GIT_LFS_SKIP_SMUDGE=1, then edit the .gif/config and finally check out master, otherwise everything fails with obscure errors.

TLDR:

  • Compression is absolutely valuable for HTTP transport and also, but not as important, for on-disk (there's git lfs dedup that's even better for on-disk
  • In theory this can be implemented as an addition to Git LFS but in practice, it's not usable this way
  • Compression must be built-in in Git LFS and the simplest solution would be to add it when "cleaning" and "smudging" (turned on with an extra Git attribute?) as it magically applies to on-disk and HTTP transport without having to update protocols and servers

swisspol avatar Mar 01 '20 01:03 swisspol

Would I be wrong in interpreting the above comment as: if we had a platform-agnostic, deterministic compression tool that we could ship alongside the installation of the lfs client binary, this feature would be trivial to support? (E.g. a binary named "lfs-gzip" based on a popular cross-platform implementation?)

That said there might be dangers in using gzip, as it isn't guaranteed to be deterministic in compression, only in decompression? https://unix.stackexchange.com/a/570554

I'm leaning towards using a different algorithm, one that would compress but is also deterministic somehow. But having just done some quick Google searches, I'm not seeing any popular algorithms or implementations specifically designed to be deterministic.

We could probably build or choose our own implementation and if cross-platform enough, call it deterministic as it's the only implementation in use. Pako is a JS re-implementation that would be cross-platform, Go would probably also have an implementation with very few platform dependencies to get in the way. Or we could always pick a particular gzip implementation and ship cross-platform builds from it, that way the likelihood of getting a different result on a different platform is greatly reduced.

I'm not sure there's a way to provide a foolproof guarantee that the same inputs produce the same results unless we manually review the compression algorithm and its implementation for likely non-deterministic behaviours though.

I'll ignore for a moment that all programs execute in a non-deterministic fashion due to the many possible errors and variances that could occur, as that isn't really practical to consider here. After all, if the result differs one time in a billion, all that happens is additional data is stored... right? Unless the gzip process corrupts the data in storage, but we could maybe optionally add a validation step after compression. That might be a good option to provide, if not already built in to a particular gzip binary.

LouisStAmour avatar Mar 16 '22 04:03 LouisStAmour