confluent-kafka-python
confluent-kafka-python copied to clipboard
Encode message indexes with varints to match java impl
Fixes #926
Otherwise the python and java implementations are mutually incompatible.
Is there anything else that needs to be done about this w.r.t. backwards compatibility? This version will be incompatible with previous versions (w.r.t. reading/writing data) in the same way that the previous versions are already incompatible with the java library.
That seems like a poor user experience (it's a "breaking change", in a sense), but I'm not sure what a reasonable mitigation is, and doing nothing (retaining incompatibility with java) is worse.
I should also mention that I was unable to run tests locally, so I'll be relying on the build to tell me if I overlooked something.
@confluentinc It looks like @gfredericks just signed our Contributor License Agreement. :+1:
Always at your service,
clabot
Thank you for this!
We'll need to think out a compatibility strategy before merging it.
I suppose we probably want to support the old and new behavior? Is it sufficient to just do that, configurably? If so,
A) what's the best configuration mechanism? B) which behavior should be the default? C) where does the configuration need to be documented?
@gfredericks , the dotnet client has the same issue. I suggest we use another config use.deprecated.format (that defaults to false) to enable the old behavior. We will do the same in the dotnet client. WDYT?
Sounds fine to me
this just amounts to adding that to the docstring and default conf for ProtobufSerializer, and then adding a new analogous conf=None arg to `ProtobufDeserializer? It doesn't look like it takes a config currently
where else does it need to be documented besides the docstrings of those classes?
@gfredericks , yes, sounds good. The use.deprecated.format (with default false) should be used in both the serializer and deserializer. I think the docstrings are sufficient. Here is the text I am using in the dotnet client: Specifies whether the Protobuf serializer should serialize message indexes without zig-zag encoding.
This doesn't affect me, but I just wanted to point out that any python-only or dotnet-only user that would naturally opt-in to this and plan to use it indefinitely could get themselves into a pretty gnarly situation if they blindly upgrade and start running this in production, since they will end up with published data in two formats and no deserializer that can read both.
(this is of course avoidable by reading the release notes carefully and setting the new config option, but I'm imagining many users won't do that)
Does that change the analysis at all?
If we want to take that concern seriously, I can think of two options
A) broadcast the change very loudly, somehow, so that users are more likely to be aware of it (rather than just have it in the docstring)
B) start by defaulting to true, and issue a warning (explaining the severity of the situation, that the default will change in the future and they could end up with incompatible data) if the user doesn't pass the config option; then in some future version switch the default to false (possibly even requiring the config option to be passed, though that's a bit extreme)
@gfredericks , since out of the box this client should interoperate with the Java client (including Connect and ksqlDB), we'll have to do a good job communicating the change, such as in the release notes as you suggested.
BTW, the issue only arises when trying to serialize/deserialize a message type that is not the first message in the Protobuf schema.
Okay; I'll add a commit soon that supports both behaviors and the config options as described above
@gfredericks , have you had a chance to look at adding the use.deprecated.format flag?
I haven't -- I could probably prioritize it for this week if that would be particularly strategic on your end.
@gfredericks , if you could prioritize this week, that would be awesome :)
I added a commit for backwards compatibility.
It looks like the travis build isn't wired up anymore?
I assume due to
Since June 15th, 2021, the building on travis-ci.org is ceased. Please use travis-ci.com from now on.
The upshot is that I don't know if the tests pass.
... seems like a poor user experience (it's a "breaking change", in a sense), but I'm not sure what a reasonable mitigation is, and doing nothing (retaining incompatibility with java) is worse.
I think the only strictly correct (and good) thing to do is version bump the non-java clients to 2.0. The timing is actually probably pretty good for this as it's been a long time since a major version bump and there are a bunch of other breaking changes we could do simultaneously.
The second best thing I think is to merge this despite the breaking change, as per the rationale above and also my rational when merging the corresponding change in .NET (note: not in a release yet, could be backed out). It is a bit questionable though.
Based on the above, I think we should merge now either way, but I also want to have a broader discussion that includes @edenhill before doing that.
isn't the first byte of the wire protocol intended to be a version identifier for the wire protocol? If so, wouldn't the logical thing to do be to increment the version of the wire protocol for the new serializer? That way, a deserializer will know how to do both, automatically, based on the first byte of the message. Correctly functioning deserializers in other languages will break when reading version 0 messages that were encoded by those previously broken serializers, but that has always been the case, apparently, so isn't a breaking change. And it might be possible to engineer those deserializers to detect a faulty version 0 and try to use a matching faulty deserializer if correct deserialization fails.
Frankly, I would deprecate the heck out of the use of version 0 by ANY serializer when this PR ships because you do not want incompatible wire protocols out there. Version 0 is already broken since there are two versions of it living out in the wild, possibly in the same cluster for some customers. If some deserializers are still expecting version 0 for all messages and some serializers are using version 1 (or a broken version 0), that's a trainwreck waiting to happen. I think confluent really MUST update the wire protocol version to handle this.
In fact, if you bump the wire protocol version for all serializers, then all you need in all future deserializers is a configuration which states whether to treat version 0 as java version 0 or python/.NET version 0. So long as a developer configures a deserializer instance with the correct version 0, everything should continue to 'just work'
It'd be even better to ship a tool that can convert an existing topic or write a corrected version to a new topic, so that upgraders can generate a topic with a consistently unserializable stream of data. It definitely wouldn't be hard to implement, depending upon how easy it is to decode the broken encoding in java.
I think the only strictly correct (and good) thing to do is version bump the non-java clients to 2.0. The timing is actually probably pretty good for this as it's been a long time since a major version bump and there are a bunch of other breaking changes we could do simultaneously.
Adding to my comment immediately above this -
Bumping the version of the serializers and deserializers isn't strictly necessary. Bumping the version of the wire protocol and then modifying the existing version of the serializers and deserializers to handle version 1 of the wire protocol is. You could certainly justify a major version bump for such a change but I think you could also justify calling it minor since you can definitely avoid a breaking change by just setting the default config for serializers to be to use version 0 of the wire protocol for whatever language is being modified and the default config for deserializers to interpret version 0 according to the previous implementation for the language in question. Then, when there is a major version bump, switch the default config to use version 1 of the wire protocol, which will be correct in all languages and deprecate the use of version 0 anywhere.
And for those starting to use schema-registry after the minor version bump, make sure the documentation STRONGLY recommends the use of the config which forces version 1 of the wire protocol and which tells the deserializer how to interpret version 0 of the wire protocol for a given topic.
@ideasculptor , a couple of comments:
- Records created by the Python serializer are fine if they encoded the first message of the Protobuf schema (because the encoding for 0 is the same whether using zig-zag encoding or not). It is only the records encoded with any other message than the first message of the schema that are problematic. The first message is by far the more common use case. (That's probably why this bug went unnoticed for so long.)
- One of the purposes of this PR is so that the Python deserializer can read records serialized by a Java client that have already encoded messages other than the first message of the Protobuf schema. Bumping the magic byte number won't help that.
Given that there are probably not many usages of encoding some message other than the first message of the schema by the Python serializer, it does not seem worth the effort to bump the magic byte number due to this bug.
I think you are dramatically underappreciating the severity of the breakage and making not necessarily valid assumptions about the likeilhood of messages without a 0 index if you ship a serializer that changes the format without bumping the protocol version. You cannot do that without changing the version of the protocol since it will preclude anyone with existing records in their topics from ever updating to the latest serializer without losing the ability to decode old messages. This is not really an optional thing - confluent pushes their platform as an event log, and anyone using it as such is likely to have actual regulatory requirements about being able to access old data and not allowing data to be rewritten/modified. Event logs often need to be immutable once written. If you break the python deserializer and/or fail to provide a mechanism for other languages to read records written by the old serializer after an update to the new one, you will force customers into regulatory non-compliance.
It makes FAR more sense to assume that users only produce messages to a given topic from one client language than it does to just assume that no one has messages that aren't at index 0 (I certainly have messages that aren't at index 0, and I have messages that are at index 0 that are encoded with 1,0 as the message index array because that shortcut of writing just 0 added unnecessary complexity to my serialization code. Removing it eliminates a special case conditional from the serialization code. The special case only works without an extra conditional when deserializing. You have to specifically include to in any serialization code, so I didn't bother, since writing 1,0 is fully correct and only costs an extra byte.
If you assume that users know what language they used to encode messages with protocol version 0, then they can configure an updated deserializer to treat protocol 0 the correct way for THEIR topics. If you leave the protocol version as 0, you screw everyone, since there is no way for any deserializer to programmatically determine how a given message was encoded, and any topic that has both encodings will now have inaccessible messages for any given deserializer. There will be no way to handle it without writing code to treat messages with a certain offset a different way - and that's assuming a naive user who updates their python package knows what the offsets were at the changeover point. And it is a whole lot more complicated to implement than simply parsing the message header conditionally based on the first byte received.
The protocol version exists for a reason. Why would you resist using it when you've got a distinct incompatibility between versions? Not making a breaking change without changing a version number is basically the first rule of reliable software development. There are going to turn out to be lots of python users who haven't realized the incompatibility exists yet who are going to find out about it only after you ship a version with an incompatible change and make note of it in the release notes, and then it will be too late for any of them to object to not bumping the version because confluent will have shipped two versions of the serializer each using protocol 0.
Frankly, if the problem is that you only allowed yourself 255 potential versions and you are reluctant to use one up, then you should also use the protocol bump to go to 2 bytes for the protocol version so that you don't have to burn yet another if and when you finally decide that is necessary. If the limited version set is already causing a problem when you have an obviously breaking change happening, you've definitely made your set too small. The sooner you fix it, the better. If you fix it now, then ANY non-0 value in the first byte can be assumed to mean that the protocol version will be carried in 2 bytes instead of 1. If you wait, then you have to treat 0 and 1 as single byte versions and so on. It only makes your problem more complicated to solve the longer you wait. Either assume that 255 versions is likely to carry you for long enough even if you burn 1 to fix the python serializer or else fix the version size problem as well.
@ideasculptor , a couple of comments:
- One of the purposes of this PR is so that the Python deserializer can read records serialized by a Java client that have already encoded messages other than the first message of the Protobuf schema. Bumping the magic byte number won't help that.
It most certainly DOES help that. It allows a deserializer to decode protocol 0 correctly with only the knowledge of which language was used to encode it. If you don't bump the version, then knowledge of the language used to encode a message is useless for helping to decode it, since it could still go either way. You don't need to just decode java serialized messages in python. You also need to decode python serialized messages in java. Just because that has been broken all along except for the special case of the first message in the schema doesn't mean that there aren't messages out there that may encounter the problem in future. All it takes is someone wanting to use a new language on an existing topic to suddenly find out that they cannot. It's particularly problematic for someone wanting to use the rest of the confluent platform, since that is all implemented in java and java will remain unable to decode messages serialized by python except in the special case.
By bumping the protocol version and providing a configurable mechanism for handling version 0 in ALL future clients, including the rest of the confluent platform, you give anyone who encounters the problem in the future a solution. Without it, you are hanging them out to dry.
@edenhill you redid this on master, is that right? https://github.com/confluentinc/confluent-kafka-python/commit/2ac0d72b24b14e5246445ad9ce66ec9c8828ef4e
Should we close this PR then?
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.