compress/gzip: use bufio to improve compress speed
What version of Go are you using (go version)?
$ go version go1.17.6
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (go env)?
go env Output
$ go env GOARCH="amd64" GOHOSTARCH="amd64" GOHOSTOS="darwin" GOVERSION="go1.17.6"
What did you do?
Use gzip to compress, for example:
// example for tar.gz compress
func TarGz(...) error {
...
// writer
fw, err := os.Create(dst)
if err != nil {
return err
}
defer fw.Close()
gw = gzip.NewWriter(fw)
defer gw.Close()
tw := tar.NewWriter(gw)
defer tw.Close()
// compress and write file
fr, err := os.Open(src)
if err != nil {
return err
}
defer fr.Close()
_, err = io.Copy(tw, fr)
if err != nil {
return err
}
...
}
In my case, the target tar file dst is on a device which has a high I/O latency.
What did you expect to see?
Time cost for TarGz() is close to the time cost for linux command tar -czf .....
What did you see instead?
The result shows that TarGz() is 8x slower than tar -czf ....
Tracing the code, I found that gzip write file every 240 bytes, which is inefficiency in my case.
Thus, I recommend that the bufio writer can be used as a buffered writer to avoid too many I/O requests in the process of compressing. It is just like what is done in compress/lzw:
// compress/lzw/writer.go
...
w.w = bufio.NewWriter(dst)
...
w.w.WriteByte(...)
...
w.w.Flush()
...
With bufio, the dst file is written every 4096 bytes (compated to 240 bytes), which decreases the I/O requests and has a significant improvement in my case.
Furthermore, it makes the request to I/O to be stable and irrelevant to the implementation of given compression algorithm.
It's straightforward for your code to wrap fw using bufio.NewWriter to get the speedup you want. As some people use gzip to write data to a network socket, it's not clear that it would be a good idea for gzip to use bufio by default.
I'm not sure I understand why it's flushing every 240 bytes. The compressor already buffers a relatively large window before flushing since there's already a built-in buffer of at least 32KiB of uncompressed input. Of course, that's for the uncompressed input, not the compressed output. Is the input highly compressible such that a large chunk of data outputs as a only a small amount?
Assuming this issue is because the data is highly compressible, I don't think we should add an output buffer automatically since memory use of DEFLATE compressors is something that matters for servers with many concurrent writers.
FYI, I think 240 bytes is set here:
https://github.com/golang/go/blob/537c4354cb9fdf8812c0448bd8f8a3b9f9ab1736/src/compress/flate/huffman_bit_writer.go#L25-L29
Thank you for the reply ! I'am confused that in compress/lzw/writer.go the buffer is set to 4096 bytes by bufio.NewWriter, but in compress/gzip/gzip.go it is set to 240 bytes because of implementation of buffman_bit_writer.
It seems that the buffer size is a choice to concern in different situations. For example, it is not a good idea to write every 240 bytes to a file.
Maybe there is some way to show the config. Or we have to recommend that bufio is always used in a I/O sensitive case, no matter you want a large buffer or a small buffer. Because different compressing algorithms implement it in different way.
I'm not sure I understand why it's flushing every 240 bytes.
The buffer size is of course rather arbitrary, but I think as with most other packages it is reasonable for users to apply the buffering that matches their use case. If you are writing to a bytes.Buffer or already have a bufio there is little benefit to add more internal buffering - you would just allocate more memory and get more cache misses.
Deflate is already rather memory intensive. I think leaving the option to add a buffer to the user is the correct approach. The buffer is big enough that writes to memory output isn't affected.