buildkit icon indicating copy to clipboard operation
buildkit copied to clipboard

Migrate off of gogo/protobuf

Open kzys opened this issue 2 years ago • 17 comments

gogo/protobuf has been deprecated since 2022. This PR uses Google's official code generators instead of gogo's.

Fixes #3119

kzys avatar Nov 16 '23 16:11 kzys

The fsutil change is in https://github.com/tonistiigi/fsutil/pull/171.

kzys avatar Nov 16 '23 17:11 kzys

Is there a guarantee that this does not break old-new client-daemon combinations?

tonistiigi avatar Nov 16 '23 18:11 tonistiigi

Yes. gogo/protobuf didn't modify the wire protocol. Their changes are primarily around code generators to make it more Go-idiomatic and efficient. I did a similar refactorting for containerd (see https://github.com/containerd/containerd/issues/6564), and it has been released as containerd 1.7.0. We haven't heard any complaints regarding the compatibility.

kzys avatar Nov 16 '23 19:11 kzys

Seeing some test failures; are those expected?

=== FAIL: solver/llbsolver TestRecomputeDigests (0.00s)
    vertex_test.go:54: 
        	Error Trace:	D:/a/buildkit/buildkit/solver/llbsolver/vertex_test.go:54
        	Error:      	Not equal: 
        	            	expected: digest.Digest("sha256:06c243a9083f537e33d82b0af0379737187f7fbba73f852fe968b592c36bd8b5")
        	            	actual  : string("sha256:06c243a9083f537e33d82b0af0379737187f7fbba73f852fe968b592c36bd8b5")

thaJeztah avatar Nov 16 '23 23:11 thaJeztah

Seeing some test failures; are those expected?

No. It should be fixed in the last revision, but the PR is still work-in-progress...

kzys avatar Nov 17 '23 04:11 kzys

By fixing all golangci-lint errors, I ended up getting failed to load cache key: docker.io/library/busybox:latest@sha256:b28d6f3d483bfa5b30562287e1197fca960ca5dd8211d8cbf1536eb937e21af8: not found which I couldn't figure out why. So I'm re-doing this PR again with much smaller commits.

kzys avatar Nov 21 '23 03:11 kzys

The tests are mostly passing! There are lint errors regarding copying proto-generated structs though.

@tonistiigi Could you take a look https://github.com/tonistiigi/fsutil/pull/171?

kzys avatar Nov 21 '23 21:11 kzys

I've fixed all lint issues! I do need to squash the last commits, though.

google.golang.org/protobuf made all proto structs with a mutex to prevent copies, which was a huge headache.

There are few places I have used proto.Clone(). We can write our clone functions (like I did in https://github.com/tonistiigi/fsutil/pull/171). Another option is using https://github.com/planetscale/vtprotobuf which provides not just reflection-less Clone, but also reflection-less Marshal/Unmarshal.

kzys avatar Nov 24 '23 01:11 kzys

looks like its happy now, except for DCO check (might want to check if you missed a DCO sign off on one of the commits)

thaJeztah avatar Nov 24 '23 19:11 thaJeztah

looks like its happy now, except for DCO check (might want to check if you missed a DCO sign off on one of the commits)

Yes. I squashed these un-signed small commits and signed them.

kzys avatar Nov 28 '23 16:11 kzys

There are few places I have used proto.Clone(). We can write our clone functions (like I did in https://github.com/tonistiigi/fsutil/pull/171). Another option is using https://github.com/planetscale/vtprotobuf which provides not just reflection-less Clone, but also reflection-less Marshal/Unmarshal.

Is it possible to get some benchmarks to understand how big issue this is and if we can stick with reflection or not.

tonistiigi avatar Nov 29 '23 01:11 tonistiigi

Are there any updates on the progress here? The only reason I care is that https://github.com/moby/buildkit/pull/4455 is accidentally dependent on it (see the PR description for why).

I totally get this PR may take some more time+effort to finish up, but if it's going to take a significantly longer time then it may be worth considering whether we should revert the changes in fsutil and have this PR depend on a fork of fsutil until it's ready to merge. Otherwise we'll be stuck in this state of not being able to update fsutil for a while.

cc @kzys (hi! long time no see 😄) @tonistiigi

sipsma avatar Jan 02 '24 16:01 sipsma

Hi again @sipsma!

Sorry, I was blocking the review. Let me update the PR shortly.

Is it possible to get some benchmarks to understand how big issue this is and if we can stick with reflection or not.

@tonistiigi - Do you have benchmarks for Buildkit itself?

kzys avatar Jan 20 '24 05:01 kzys

Do you have benchmarks for Buildkit itself?

I think for this we would do micro-benchmarks for marshaling, unmarshal. If we start running containers then this would skew the timing too much. In fsutil there was also a benchmark script that maybe can be used. Maybe LLB marshal and load is a nice candidate.

The biggest possible performance issue would be with the session endpoint and the context and build result transfer. This is where 100s of MB can potentially be transferred and encoded/decoded. If session uses gRPC dialer then it also uses protobuf for raw data framing. I believe previously some of these decoding could avoid allocating more memory because the Unmarshal code was reusing the buffer by resetting it with [:0]. For these big data transfers, we should avoid new memory allocations that grow with amount of transferred data.

tonistiigi avatar Apr 02 '24 17:04 tonistiigi

Just wondering, coz of the movements in the codebase, we are bound to have a lot of merge conflict as we go. Is possible to split this into some self-contained chunks like perhaps sub-packages and chain the PRs?

profnandaa avatar Apr 09 '24 06:04 profnandaa

I did a small subset of this change locally and added some benchmarks to some of the solver ops specifically and it seems like the google protobuf will definitely cause a regression.

benchmark                        old ns/op     new ns/op     delta
BenchmarkExecOp_Marshal-10       148           475           +220.46%
BenchmarkSourceOp_Marshal-10     234           608           +160.39%
BenchmarkFileOp_Marshal-10       143           475           +231.63%
BenchmarkMergeOp_Marshal-10      36.0          152           +320.64%
BenchmarkDiffOp_Marshal-10       31.2          152           +385.12%

It seems like there now exists vtprotobuf which I'll do a quick test of to see if that is a feasible alternative.

jsternberg avatar Aug 15 '24 18:08 jsternberg

Some additional numbers comparing gogo protobuf with vtprotobuf:

benchmark                        old ns/op     new ns/op     delta
BenchmarkExecOp_Marshal-10       148           118           -20.59%
BenchmarkSourceOp_Marshal-10     234           128           -45.01%
BenchmarkFileOp_Marshal-10       143           100.0         -30.17%
BenchmarkMergeOp_Marshal-10      36.0          35.2          -2.25%
BenchmarkDiffOp_Marshal-10       31.2          27.2          -13.02%

Seems pretty easy to integrate so we might just be able to do that.

jsternberg avatar Aug 15 '24 19:08 jsternberg