remote-apis
remote-apis copied to clipboard
V3 idea: Let Digest.hash use 'bytes' instead of 'string'
Right now the Digest message uses a string to store the checksum of the object. This is wasteful, because for the hashing algorithms that we use, the output is always a sequence of bytes. Using a bytes field cuts down the storage space by a half (for that field; not necessarily for the full object).
Hi Ed! Thanks for filing this.
We've discussed this a couple times within Google in 2017/2018, so copying out the conclusions we came to from one of those evaluations as a starting point. I think we may also have discussed it in the community in the old BuildFarm group as well, but I couldn't find a handy reference for that discussion. Regardless, what we came up with then was:
Arguments in favor of representing hashes as bytes when sent over the network:
- Reduces the bytes sent over the network: Even when using gzip, a hexadecimal representation increases network bandwidth requirements by around 30% (after eliminating outputs) for both cache hits and misses.
- For a 4000 action build, all cache hits: 4000 * 3 (digests) * 32B * 30% ~= 115kB
- For a 4000 action build, no cache hits: 4000 * ~100 digests * 32B * 30% ~= 3.8MB
- Average is >90% cache hits -> 487 kB
- Ballpark $60/year in bandwidth for TensorFlow (assuming GCP prices, 10k builds/week at ~10000 actions) [Date of numbers unknown; probably early 2018]
- There’s no need to convert between bytes <-> hex, except during debugging.
Arguments against:
- Hex must still be specified as a format for blobs in the API.
- (Unless ByteStream is eliminated) for specifying digests in URLs
- All clients and servers should log values consistently, to allow easy out-of-band communication during debugging.
- Spreads logic for dealing with digest format through code, vs encapsulated
- Current debugging tools do not print bytes in an easily readable format.
- It's an API change vs today, and so default is "no change" unless proven to be sufficiently worthwhile.
At the time, the deciding factor seemed to come down to whether the difference in bandwidth was worth the effort of changing and the increase in complexity in dealing with two formats. (Noting that for backend storage you can use whatever you want, so it's only the API that strictly must be defined one way or another). This could be argued on the grounds that either that the cost of the bandwidth is material, or that builds are measurably slowed (wallclock) because of the increase in time required to move the bytes around.
Ultimately, we didn't find this to be a particular bottleneck, and so the value of changing it seemed insufficiently high to justify a change. Of course, time has passed, Bazel Without the Bytes is now live and used in more contexts (and other clients have similar download-avoiding optimizations), etc, so - if you (or anyone else) wants to do a fresh analysis to argue that it's now sufficiently material to warrant the change, please feel free! But understand that a data-driven argument is going to be required here, as the effort in revisiting this (and implementing the change if that's what we come to) is significant.
Cheers, Eric
Microsoft AnyBuild team is in favor of this change, our caching implementation internally uses only bytes so strings are an impedance mismatch except for text logging. Plus for equality comparisons a case-insensitive string comparison is orders of magnitude slower (more CPU expensive) than an optimized byte comparison implementation.
I'm also in favor of this change. We're using Digest as a key in a few in-memory data structures, and the 2x has shown up as a significant contributor to memory consumption for us. We're also using Digest in some cases for machine-to-machine communication, so we currently have to convert back and forth with two separate classes for digests. On top of that, hex digests also require an additional consistency check: Are all characters valid hex digits?
I think the memory consumption / conversion cost may also apply to Bazel - IIRC, it stores digests as binary to avoid the 2x memory blowup, but then it has to convert every time it talks to the network. The alternative would seem to be preferable. Though note that this isn't the biggest problem in Bazel right now - there are still a bunch of things that can be improved before this would even be on the radar.