lorawan-stack
lorawan-stack copied to clipboard
Migrate away from gogo/protobuf
Summary
https://github.com/gogo/protobuf is not mantained any more https://github.com/gogo/protobuf/issues/691 (currently)
Why do we need this?
It's our dependency, which is incompatible with new golang/protobuf version, which more and more packages depend on, hence we need to replace the golang/protobuf version, depending on outdated versions of our direct dependencies and potentially even breaking packages this way
What is already there? What do you see now?
gogo/protobuf dependency
What is missing? What do you want to see?
Figure this out
How do you propose to implement this?
Figure out if a new maintainer will appear or different plugin with feature parity? Use just vanilla protobuf?
How do you propose to test this?
tests
Can you do this yourself and submit a Pull Request?
yes
Validation plugin we're using dropped support for GoGo https://github.com/envoyproxy/protoc-gen-validate/pull/340
I think the best way forward is to follow the ecosystem and migrate away from gogo/protobuf. With more and more of our other dependencies moving away from gogo, I think it will become increasingly difficult to keep using it. Of course it's going to be a lot of work to migrate, so if we do it, we need to come up with a good plan.
@rvolosatovs probably knows more about the custom options that are set in our gogottn generator, but here's what I found for the explicit options in our proto files:
- We could start by removing the
gogoproto.customname,gogoproto.stdtimeandgogoproto.stdduration, andgoproto_enum_prefixoptions. Those are relatively easy to remove, since the Go compiler will immediately complain about any resulting issues. - Removing
gogoproto.embedoption would mean that we can no longer access embedded fields (the Go compiler will help us find those), and that messages no longer satisfy some interfaces (this may be more difficult). - The
gogoproto.nullableoption will be much more work, because we will have to start using the getters, and add nil-checks. Resulting problems may not get caught by the Go compiler. Possible workaround would be to temporarily make those fields private, then rewrite to getters/setters and finally making the fields public again. - What's going to be quite problematic are the fields that use
gogoproto.customtypeand enums that usegogoproto.enum_stringeroptions. For those we've often changed the way they're marshaled/unmarshaled to JSON. For the custombytesfields such as EUI, DevAddr etc. we could change the type (in the proto messages) tostring(which is binary compatible). With the enums I'm afraid it's going to be breaking the JSON API, since those are now accepted (by UnmarshalJSON) as both strings and ints.
Maybe this is also a good time to start thinking about our v4 API, because I can imagine we might discover some more (API breaking) surprises.
I think the best way forward would be to first try out https://github.com/alta/protopatch. Depending on the result:
- If all our needs are covered -> migrate and forget about this(should be just search-and-replace in
apidirectory) - If only some of our needs are covered and there are custom options not covered by the plugin -> we should evaluate on per-option basis and either remove these custom options or, perhaps, contribute to the
protopatchif it's a low-effort feature. This really depends on the option, though - if we're talking aboutcustomtype- that IMO definitely justifies contributing, but maybe something likestdtime- not so much.
Looking forward, I don't think we should be directly using vanilla protobuf protos in components internally at runtime(given the provided feature set of protobuf today). It only makes sense to use protobuf for (de-)serialization, so for storage and on API layer. Internally, however, using plain vanilla generated Go protos makes no sense to me. So, for example, NS:
- get
*ttnpb.EndDevice(vanilla generated Go type) from the registry, deserialized from stored binary data - convert
*ttnpb.EndDeviceintoT_device, (NOTE: perhaps that could be just a wrapper initially or forever) - use
T_deviceinternally in NS - convert
T_deviceinto*ttnpb.EndDevice(NOTE: this could be a trivial, very fast task if we're using a wrapper, since we only need to modify changed fields and that could even be performed on binary data directly) - set
*ttnpb.EndDevice, serialize into binary data
Refs also https://github.com/TheThingsNetwork/lorawan-stack/issues/342 (generated populators)
I'm not in favor of a (smaller) alternative to gogo. It feels like pushing the can. Let's keep things as vanilla as possible, especially when we need to decide again what the best way forward is.
I do agree that we can consider using intermediary types in some places, instead of relying everywhere on the generated protos. It's basically separating data transfer objects (DTOs: protos, also for storage) from data access objects (DAOs: how we use them). If that is primarily reading, we can also declare interfaces and see how far we get with that.
That said, I wouldn't go as far as changing the entire NS to using T_device, but rather specific structs and/or interfaces as needed.
Let's move this discussion to November
@rvolosatovs what is your objection against moving to vanilla with a custom JSON marshaler?
The huge migration burden and loads of boilerplate if we end up just using vanilla protos directly. I don't really object to that though, I just think we should try to find a simple non-intrusive alternative first and if that's not possible, then resort to reworking this whole thing.
I'm afraid that any plugin that we start to rely on will end up in an unmaintained state at some point. Generally speaking, I'm in favor of keeping things as close to vanilla as possible. If that means nil checking more often than we like, then so be it. It can also work in our favor that we know that things are not set, instead of an initialized struct.
I'm afraid that refactoring our entire codebase is going to be a pain no matter how we do it. Our (gogo-generated) proto structs are used everywhere right now (gRPC API, HTTP API, events, errors, internally, Redis DB, ...), so changing to something else (whatever that other thing is) is going to touch pretty much everything in our codebase, and the way it looks now, all at the same time.
The hard requirement is that we don't break compatibility of our v3 API. Even if we decide to use this situation as the moment to start working on a v4 API (at least internally), we will still have to keep supporting that v3 API for existing users.
In the long term, I think we would do ourselves a big favor by decoupling our (versioned, stable-within-major) external APIs from our internal (unversioned, stable-within-minor) API and our (versioned, stable) database documents. We could then write or generate functions to convert between our internal APIs and the others.
But I think there are some steps that we can already take now:
JSON
In order to keep our v3 JSON API compatible, I think our first TODO is to work on generating JSON marshalers and unmarshalers that understand how to marshal/unmarshal our custom types. I think doing this is smart anyway because there is no stability promise for Go's implementation of the JSON format for protocol buffers, so we better have control over that ourselves. Doing this could also allow us to consider field masks when marshaling to JSON. In the grpc-gateway runtime we can register codecs, so we can just write a codec that calls our own (generated) (un)marshalers instead of {gogo,golang}/protobuf's jsonpb.
Generate new protos next to the old ones
I already tried that here: https://github.com/TheThingsNetwork/lorawan-stack/commit/a41f62d98ae7ee719b576e6fcd2009a79cd38f4c
This does make protobuf complain about the types registry, so we may need to remove golang_proto.RegisterType from our old protos to make this work. Removing that could potentially break resolving of google.protobuf.Any, but we only use those in errors and events, so we can probably find workarounds for those specific cases.
Generate functions to convert between old and new protos
This is for the transition period only, but for the long term solution, we'd want to generate similar converters.
Update services one by one
I already tried that with a simple service here: https://github.com/TheThingsNetwork/lorawan-stack/commit/cd7d75c8b42ad15eee1ac594ff6d0f2d5a75eb67, but for more complicated services we'd definitely need those converters.
Note that this only changes the grpc service itself. The grpc-gateway still uses the old gogo stuff on the JSON side, and then calls the internal gRPC server, which then runs the new implementation.
Pushed some initial dependency updates and backwards compatibility work-arounds here: https://github.com/TheThingsNetwork/lorawan-stack/compare/issue/2798-codec
More and more of our dependencies are upgrading to protobuf 1.4 and the V2 API, and the longer we keep this open, the more problems we'll have when trying to upgrade our dependencies.
We should really give this some more priority and make a decision on what we're going to do about all this.
Please plan a call for next week so we can discuss offline.
I think we should go through this pain process and concentrate on solving this in a week or two. And to avoid that we do other things as this is going to cause lots of conflicts otherwise. Having as many hands as possible requires knowing exactly what we are going to do in which cases, dividing tasks as much as possible and keeping eyes on the prize.
Next steps:
- @rvolosatovs updates the tooling to be as close as possible to "vanilla":
- Remove things like
unconvert,gofumptand whatever else we're doing on top of protoc - Switch from
protoc-gen-gogottntoprotoc-gen-gofast(or whatever is closests to vanilla) - Explicitly add
(gogoproto.*)options in our proto files, so that they render the same as now
- We need to refactor our code to use Getters instead of direct field access. Perhaps tools like
goplsandrfcan help with this. - We start removing
(gogoproto.*)options one by one and update code that uses it. Perhaps tools likegoplsandrfcan help with this.
- Removing
gogoproto.populateand updating tests (https://github.com/TheThingsNetwork/lorawan-stack/issues/342) - Removing
gogoproto.customnameand changingEUI -> Euietc. - Removing
gogoproto.embed. We do need to make sure that messages still implement interfaces such asValidateContext(context.Context) errorandExtractRequestFields(m map[string]interface{}). - Removing
gogoproto.nullableand making sure we use Getters where possible, and do nil checks otherwise. - ...
@rvolosatovs let's try to make those first steps for v3.11.3. When that's done please re-add the other assignees, and let's discuss again.
I think it's time for the entire team to start working this, because I really can't (and don't want to) do this alone. It's very time consuming work, and often frustrating, but it needs to be done.
Before my vacation I've spent a lot of time on generating JSON (un)marshalers, which should let us remove the gogoproto.customtypes and hand-written JSON (un)marshalers.
Unfortunately, we still have quite some manual work to do for removing gogoproto.nullable and gogoproto.embed, and I'm afraid this will be a matter of removing hundreds of those options one by one, and making sure that the code that uses those fields doesn't break.
Things to look at:
- Does the message still implement the interfaces that it did? This isn't caught by the compiler. As an example: many request messages currently implement
ttnpb.IDStringer, which is used by rate limiting. Removinggogoproto.embed=truemeans that a request message may no longer implements that interface, which would break rate limiting. There are probably more of such interfaces, how can we catch those? - Do we not introduce nil pointer dereferences? When removing
gogoproto.embed=trueand/orgogoproto.nullable=falseit's possible that some reads/writes to fields are not possible anymore, even though the compiler is perfectly happy. - What else?
I plan to remove some gogoproto.embed and gogoproto.nullable options from user.proto today.
I'll take up gogoproto.embed from deviceclaimingserver.proto today.
In the next pass, I'll take up gateway_services.proto and gateway.proto; the latter being pure drudgery.
I plan to remove some gogoproto.embed and gogoproto.nullable options from organization.proto today.
Will do the same for client.proto.
And today for application.proto.
@KrishnaIyer I'll do the GatewayIdentifiers in gateway.proto, because I've been doing the same for other identifiers in the last couple of days.
So for @adriansmares @bafonins @johanstokking @kschiffer @pgalic96:
- In this round we are removing
gogoproto.embedandgogoproto.nullable - We leave fields with
gogoproto.customtypealone - Search in the repository to see which
.protofiles still have these options - Claim the file by posting a comment here
- Remove the options, fix the rest of the code and submit a PR (see also https://github.com/TheThingsNetwork/lorawan-stack/issues/2798#issuecomment-916737093)
In the next pass, I'll pick up joinserver.proto and metadata.proto.
@KrishnaIyer I'd like to take over joinserver.proto as I'm working on a refactor (https://github.com/TheThingsNetwork/lorawan-stack/issues/4677).
Ok sure 👍
Picking up the applicationserver_*.proto files.
I'm working on the remaining gogoproto.embed options in application.proto, client.proto, gateway.proto, organization.proto and user.proto today.
The following files still have gogoproto.embed options left:
- [x]
api/applicationserver_web.proto - [ ]
api/end_device.proto - [ ]
api/identifiers.proto - [x]
api/identityserver.proto - [ ]
api/join.proto - [ ]
api/lorawan.proto - [x]
api/message_services.proto - [x]
api/messages.proto - [x]
api/metadata.proto - [x]
api/oauth.proto - [x]
api/rights.proto - [x]
api/search_services.proto
And these files (also) have gogoproto.nullable options that need to be removed:
- [ ]
api/application.proto - [x]
api/client.proto - [x]
api/deviceclaimingserver.proto - [ ]
api/end_device.proto - [x]
api/events.proto - [x]
api/gateway.proto - [x]
api/gatewayserver.proto - [ ]
api/identifiers.proto - [ ]
api/identityserver.proto - [ ]
api/join.proto - [x]
api/joinserver.proto - [x]
api/lorawan.proto - [x]
api/message_services.proto - [x]
api/messages.proto - [x]
api/metadata.proto - [ ]
api/oauth.proto - [x]
api/organization.proto - [x]
api/qrcodegenerator.proto - [x]
api/regional.proto - [x]
api/rights.proto - [x]
api/search_services.proto - [x]
api/user.proto
Everyone please claim some more files!
As a reminder:
- In this round we are removing
gogoproto.embedandgogoproto.nullable - We leave fields with
gogoproto.customtypealone - Remove the options, fix the rest of the code and submit a PR (see also https://github.com/TheThingsNetwork/lorawan-stack/issues/2798#issuecomment-916737093)