buildkit
buildkit copied to clipboard
Migrate off of gogo/protobuf
gogo/protobuf has been deprecated since 2022. This PR uses Google's official code generators instead of gogo's.
Fixes #3119
The fsutil change is in https://github.com/tonistiigi/fsutil/pull/171.
Is there a guarantee that this does not break old-new client-daemon combinations?
Yes. gogo/protobuf didn't modify the wire protocol. Their changes are primarily around code generators to make it more Go-idiomatic and efficient. I did a similar refactorting for containerd (see https://github.com/containerd/containerd/issues/6564), and it has been released as containerd 1.7.0. We haven't heard any complaints regarding the compatibility.
Seeing some test failures; are those expected?
=== FAIL: solver/llbsolver TestRecomputeDigests (0.00s)
vertex_test.go:54:
Error Trace: D:/a/buildkit/buildkit/solver/llbsolver/vertex_test.go:54
Error: Not equal:
expected: digest.Digest("sha256:06c243a9083f537e33d82b0af0379737187f7fbba73f852fe968b592c36bd8b5")
actual : string("sha256:06c243a9083f537e33d82b0af0379737187f7fbba73f852fe968b592c36bd8b5")
Seeing some test failures; are those expected?
No. It should be fixed in the last revision, but the PR is still work-in-progress...
By fixing all golangci-lint errors, I ended up getting failed to load cache key: docker.io/library/busybox:latest@sha256:b28d6f3d483bfa5b30562287e1197fca960ca5dd8211d8cbf1536eb937e21af8: not found which I couldn't figure out why. So I'm re-doing this PR again with much smaller commits.
The tests are mostly passing! There are lint errors regarding copying proto-generated structs though.
@tonistiigi Could you take a look https://github.com/tonistiigi/fsutil/pull/171?
I've fixed all lint issues! I do need to squash the last commits, though.
google.golang.org/protobuf made all proto structs with a mutex to prevent copies, which was a huge headache.
There are few places I have used proto.Clone(). We can write our clone functions (like I did in https://github.com/tonistiigi/fsutil/pull/171). Another option is using https://github.com/planetscale/vtprotobuf which provides not just reflection-less Clone, but also reflection-less Marshal/Unmarshal.
looks like its happy now, except for DCO check (might want to check if you missed a DCO sign off on one of the commits)
looks like its happy now, except for DCO check (might want to check if you missed a DCO sign off on one of the commits)
Yes. I squashed these un-signed small commits and signed them.
There are few places I have used proto.Clone(). We can write our clone functions (like I did in https://github.com/tonistiigi/fsutil/pull/171). Another option is using https://github.com/planetscale/vtprotobuf which provides not just reflection-less Clone, but also reflection-less Marshal/Unmarshal.
Is it possible to get some benchmarks to understand how big issue this is and if we can stick with reflection or not.
Are there any updates on the progress here? The only reason I care is that https://github.com/moby/buildkit/pull/4455 is accidentally dependent on it (see the PR description for why).
I totally get this PR may take some more time+effort to finish up, but if it's going to take a significantly longer time then it may be worth considering whether we should revert the changes in fsutil and have this PR depend on a fork of fsutil until it's ready to merge. Otherwise we'll be stuck in this state of not being able to update fsutil for a while.
cc @kzys (hi! long time no see 😄) @tonistiigi
Hi again @sipsma!
Sorry, I was blocking the review. Let me update the PR shortly.
Is it possible to get some benchmarks to understand how big issue this is and if we can stick with reflection or not.
@tonistiigi - Do you have benchmarks for Buildkit itself?
Do you have benchmarks for Buildkit itself?
I think for this we would do micro-benchmarks for marshaling, unmarshal. If we start running containers then this would skew the timing too much. In fsutil there was also a benchmark script that maybe can be used. Maybe LLB marshal and load is a nice candidate.
The biggest possible performance issue would be with the session endpoint and the context and build result transfer. This is where 100s of MB can potentially be transferred and encoded/decoded. If session uses gRPC dialer then it also uses protobuf for raw data framing. I believe previously some of these decoding could avoid allocating more memory because the Unmarshal code was reusing the buffer by resetting it with [:0]. For these big data transfers, we should avoid new memory allocations that grow with amount of transferred data.
Just wondering, coz of the movements in the codebase, we are bound to have a lot of merge conflict as we go. Is possible to split this into some self-contained chunks like perhaps sub-packages and chain the PRs?
I did a small subset of this change locally and added some benchmarks to some of the solver ops specifically and it seems like the google protobuf will definitely cause a regression.
benchmark old ns/op new ns/op delta
BenchmarkExecOp_Marshal-10 148 475 +220.46%
BenchmarkSourceOp_Marshal-10 234 608 +160.39%
BenchmarkFileOp_Marshal-10 143 475 +231.63%
BenchmarkMergeOp_Marshal-10 36.0 152 +320.64%
BenchmarkDiffOp_Marshal-10 31.2 152 +385.12%
It seems like there now exists vtprotobuf which I'll do a quick test of to see if that is a feasible alternative.
Some additional numbers comparing gogo protobuf with vtprotobuf:
benchmark old ns/op new ns/op delta
BenchmarkExecOp_Marshal-10 148 118 -20.59%
BenchmarkSourceOp_Marshal-10 234 128 -45.01%
BenchmarkFileOp_Marshal-10 143 100.0 -30.17%
BenchmarkMergeOp_Marshal-10 36.0 35.2 -2.25%
BenchmarkDiffOp_Marshal-10 31.2 27.2 -13.02%
Seems pretty easy to integrate so we might just be able to do that.