brew icon indicating copy to clipboard operation
brew copied to clipboard

Support optional external compressor for bottles (to support zopfli, pigz, etc.)

Open johnsonjh opened this issue 1 year ago • 14 comments

Provide a detailed description of the proposed feature

Support calling an external compressor for bottles, such as zopfli or pigz.

What is the motivation for the feature?

Would result in size savings of 5-15% for bottles. On the receiving side, decompression of Zopfli-created gzip streams is of equal speed and often 1-2% faster with no client changes required, so there is no downside, beyond slower (one-time) compression of the bottles.

How will the feature be relevant to at least 90% of Homebrew users?

100% of users would benefit from faster downloads due to smaller packages. A smaller percentage of users (who self-host bottles) would benefit additionally from storage space and bandwidth savings.

What alternatives to the feature have been considered?

Currently, bottles can be manually decompressed and recompressed using zopfli or pigz -11 and a new sha256sum calculated. This works fine but is a time consuming manual process.

johnsonjh avatar Jul 30 '22 02:07 johnsonjh

I have been contemplating using Zstd to compress our bottles instead. Zstd is nice because it's fast and has a better compression ration than gzip. However, this would require the user to have zstd installed to pour bottles, and it clearly cannot be used to compress the bottles of zstd and its dependencies.

Using zopfli is slow, but it also uses the same format as gzip, so users won't need additional software to pour bottles.

carlocab avatar Jul 31 '22 09:07 carlocab

However, this would require the user to have zstd installed to pour bottles, and it clearly cannot be used to compress the bottles of zstd and its dependencies.

Yeh, that rules it out I'm afraid.

Using zopfli is slow, but it also uses the same format as gzip, so users won't need additional software to pour bottles.

Agreed. This is a nice idea, thanks @johnsonjh. Marking as help wanted.

MikeMcQuaid avatar Aug 01 '22 13:08 MikeMcQuaid

I've taken a look at building bottles with zopfli. For reference, the section that I'm looking at replacing is: https://github.com/Homebrew/brew/blob/8dc46a7c477929185cba8ca0de5f9c843b3e9385/Library/Homebrew/dev-cmd/bottle.rb#L428-L436

We create the tarball as two steps to support build reproducibility. Apart from simply passing the archive into our GzipWriter, there's two other things that are set here:

  • We set the Gzip timestamp to modified time of the tab (rather than to the time the TAR archive was created) so that time of build does not affect resulting checksum
  • We set the original file name field in the Gzip header to indicate the name of the original TAR archive contained within

Notably, zopfli handles these two fields differently (and I've yet to find a way to configure/change the behavior):

  • It always sets the timestamp to 00 00 00 00 (midnight UTC, 1 Jan 1970)
  • It does not store the original file name of anything it compresses

See relevant section in https://github.com/google/zopfli/blob/master/src/zopfli/gzip_container.c#L93-L98 on how zopfli sets these flags in the Gzip header.

Are these changes in how the fields are handled acceptable for switching to zopfli for bottling? I think build reproducibility would be preserved, although the values of these fields differs from the current implementation. I don't foresee the loss of original file name in the bottle causing problems for most people; gzip will prefer to strip .gz extension to get output file name unless -N/--name is passed.

In short, I don't foresee these differences breaking how bottles work for anyone, but there could always be some user(s) doing weird things with our bottles.


On another note, I do think a more comprehensive CI time vs bottle size investigation is warranted. I tried a makeshift test with the Monterey bottle for openjdk@11 (since it's sitting in my cache folder):

Compression method Compression time Resulting Size
gzip (default compression level - probably 6?) 13s 191.4 MB (60.8% of original)
zopfli (default compression level - 15 iterations) 19m45s (91.2x times as long as gzip) 188.4 MB (59.8% of original)

It may also be worth trying with a small bottle; an extra 20 mins of build time on an already long build for something like openjdk@11 might not be as big of a deal as an extra 15-30 seconds on each small bottle job. That's on the (possibly wrong) generalization that big bottles tend to result from long builds.

alebcay avatar Aug 26 '22 04:08 alebcay

I think build reproducibility would be preserved, although the values of these fields differs from the current implementation.

We'll need a way to migrate bottles one-by-one, in that case, with some sort of DSL or list. We don't want to change the underlying gzip mechanism and invalidate all bottle reproducibility work.

On another note, I do think a more comprehensive CI time vs bottle size investigation is warranted. I tried a makeshift test with the Monterey bottle for openjdk@11 (since it's sitting in my cache folder):

Ouch. This slowdown/size comparison makes to not worth it for us at all, unfortunately.

MikeMcQuaid avatar Aug 26 '22 07:08 MikeMcQuaid

We'll need a way to migrate bottles one-by-one, in that case, with some sort of DSL or list. We don't want to change the underlying gzip mechanism and invalidate all bottle reproducibility work.

To clarify what I meant - bottles produced with zopfli should continue to be reproducible as far as building a bottle now vs in a few hours should produce the same result (since timestamp and filename info is completely scrapped).

As you suggest, bottles produced with zopfli would not be reproducible or have the same checksum when compared to an existing bottle of the same contents that have been produced by GzipWriter. However, I'm not sure I'm seeing how this feature/characteristic is needed - could we just leave existing bottles as-is and start producing new bottles with zopfli? Some care may be needed on handling :all bottles.


Ouch. This slowdown/size comparison makes to not worth it for us at all, unfortunately.

On the opposite end of the tradeoff, would there be interest in using pigz instead to see if we can reduce compression time? Resulting tarballs can still just be unpacked as usual by ordinary gzip. Trying again with the Monterey openjdk@11 bottle on a quad-core machine:

Compression method Compression time Resulting size
gzip (default compression level - probably 6?) 11.97s 191.4 MB
pigz (default compression level - probably 6?) 3.928s (0.328x as long as gzip) 191.4 MB
pigz -11 15m9s (75.94x as long as gzip) 187.4 MB
pigz -9 6.363s (0.532x as long as gzip) 191.1 MB

Knowing the difference between gzip/pigz, I guess we would only see a speedup if CI machines have more cores. Looking at the resulting Gzip file headers, it does set file name and timestamp taken from the source file (TAR archive in our case). I don't see a way to override it to a custom value, but it can be specified not to store the info (kind of like what zopfli does).

alebcay avatar Aug 26 '22 16:08 alebcay

To clarify what I meant - bottles produced with zopfli should continue to be reproducible as far as building a bottle now vs in a few hours should produce the same result (since timestamp and filename info is completely scrapped).

Yup. Without this it'd be entirely unusable for us.

However, I'm not sure I'm seeing how this feature/characteristic is needed - could we just leave existing bottles as-is and start producing new bottles with zopfli?

We could but we will need to track this so that e.g. users can still run brew bottle and reproducibly create existing bottles that haven't been updated yet.

On the opposite end of the tradeoff, would there be interest in using pigz instead to see if we can reduce compression time?

If the resulting size is the same, it's faster and still reproducible: yes, this sounds good.


Honestly, given the caveats though, I'm not sure this effort is worthwhile. It seems like a lot of work for (overall) minimal time savings, more dependencies and reproducibility impacts.

MikeMcQuaid avatar Aug 26 '22 17:08 MikeMcQuaid

Using pigz -11 enables the use of zopfli when creating .gz files. On my desktop system here (24-cores), the speed of pigz -11 isn't terrible compared to non-parallel gzip, and on server systems using pigz -11 sometimes results in faster compression than plain gzip.

Also, I find using pigz -9 isn't slower than pigz -6 because I/O becomes the bottleneck.

johnsonjh avatar Aug 26 '22 20:08 johnsonjh

I've added results for pigz -9 and pigz -11 in my table above. Again, I'm using a quad-core system here so that affects these numbers, but I think our CI systems are hexa-core or octa-core? Naturally, with more cores pigz will go faster (until the operation is I/O-bound like you describe).

In any case, with Zopfli we seem to be saving a few MB (< 5%) at best. We could probably reduce compression time with pigz -6 but the time spent packaging the .tar into .tar.gz isn't really a big portion of the time spent on building bottles, so even substantial improvements in this stage yield a very minor improvement on the overall process.

With changing to either Zopfli or pigz, the reproducibility concern that Mike raised earlier still stands.


On the original topic of this issue of just extending support for an external compressor to be called, I would be fine with the idea of making an interface available for users who host taps or self-host bottles to use a Gzip-compatible compressor of their own choosing (although this project would continue to just use the existing GzipWriter implementation that is built into Ruby).

On the other hand, this does add complexity for something that we don't use and might not test as much. An additional stage of patching the timestamp and filename in the header of the Gzip file after creation might also be necessary (for cross-machine reproducibility purposes, e.g. building :all bottles) - we don't know anything about the user's choice of custom compressor and how it impacts the timestamp/filename stored in the Gzip header. I'm not sure this complexity is something that is worth including given the modest size savings or time savings.

On a tangent, I can't even find a tool to modify Gzip headers easily after the file has been created. I'm surprised such a tool doesn't exist yet.

alebcay avatar Aug 26 '22 22:08 alebcay

On the other hand, this does add complexity for something that we don't use and might not test as much.

Yes, this would be my primary objection.

It's a feature that there's only one way of producing a bottle and that we handle reproducibility concerns for them.

MikeMcQuaid avatar Aug 30 '22 08:08 MikeMcQuaid

Honestly, given the caveats though, I'm not sure this effort is worthwhile. It seems like a lot of work for (overall) minimal time savings, more dependencies and reproducibility impacts.

I haven't looked into reproducibility, but this is part of why I wouldn't so easily dismiss Zstd: it offers a good compromise between speed and compression ratio. This is also why it's starting to be used pretty widely (example).

As a benchmark, for openjdk:

Compression method Compression time Resulting size
existing bottle 7.36s 189 MB
zstd (default compression level - 3) 0.69s 179.48 MB
zstd -6 1.99s 175.15 MB
zstd -12 6.90s 170.73 MB
zstd -19 (max, without --ultra) 1m9.76s 165.33 MB

With the right flags, we can probably get modest savings in speed and significant savings in size.

~~Actually, I just realised my times above are slightly inflated, because I included the time to do gunzip in those times.~~

carlocab avatar Aug 30 '22 11:08 carlocab

I haven't looked into reproducibility, but this is part of why I wouldn't so easily dismiss Zstd

The criteria:

  • needs to be able to either bit-for-bit reproduce existing bottles OR we add a DSL to preserve existing reproducibility
  • needs to be able to be used for the entire dependency tree, including itself
  • needs to be faster (or same) to compress AND decompress AND have improved (or same) size
  • the code required to implement the above needs to be sufficiently simple to warrant this

To be completely honest, though, this doesn't feel like anywhere near the top 10 (maybe top 100) problems that maintainers or users have with Homebrew today.

MikeMcQuaid avatar Aug 30 '22 11:08 MikeMcQuaid

Oh, and I forgot to mention: macOS tar can untar *.tar.zst archives, so you don't need an extra dependency to pour the bottles on macOS.

carlocab avatar Aug 30 '22 11:08 carlocab

I would really like to see Homebrew uses zstd, or a more efficient format in the future.

FYI Arch Linux made the switch to zstd in early 2020 and it's well received. Same goes for Fedora. Snap added support for LZO in late 2020 (they didn't choose zstd because of other constraints).

What I'm saying it's that some big players are all favoring more efficient format. They consider it valuable and Homebrew, being a package manager after all, should not be an exception.

kidonng avatar Sep 18 '22 05:09 kidonng

They consider it valuable and Homebrew, being a package manager after all, should not be an exception.

Many other package managers have not and will not make the switch.

I've laid out the requirements above. If someone has an implementation that meets these requirements: great. Until then: agreement here doesn't make much different either way.

MikeMcQuaid avatar Sep 18 '22 12:09 MikeMcQuaid

Passing on this, sorry, it's too involved for us to really consider.

MikeMcQuaid avatar Feb 16 '23 14:02 MikeMcQuaid