ksql
ksql copied to clipboard
KLIP-0 Proposal - Adding Protobuf Support
Description
Discussion around KLIP-0 - Adding Protobuf Support
It looks like @Crim hasn't signed our Contributor License Agreement, yet.
The purpose of a CLA is to ensure that the guardian of a project's outputs has the necessary ownership or grants of rights over all contributions to allow them to distribute under the chosen licence. Wikipedia
You can read and sign our full Contributor License Agreement here.
Once you've signed reply with [clabot:check] to prove it.
Appreciation of efforts,
clabot
Thanks for creating the first Klip @Crim ! We will have a look and get back to you soon!
Thanks @Crim!
Also, chuckling at starting the KLIPs with a 0. ;-)
See also https://github.com/confluentinc/ksql/pull/1464 (POC implementing protocol buffer message format), also by @Crim.
This is related to: https://github.com/confluentinc/ksql/issues/1057 We have also been asked to add Binary Json support. So rather than doing this specifically as Protobuf, Binary Json, Some other serialization, it would be better to do this generically via extensions. We could achieve this in a similar way to how UDFs work.
So the suggestion is to provide an interface that anyone can implement that handles deserializing from format X into the appropriate intermediate KSQL format?
we’ll define some annotations and/or interfaced that serde classes can use, then load them from a /ext dir on start up
@Crim it's gonna be a great feature for KSQL :) thanks !
Do you know when it could be released?
Any update on this? We might be interested in helping if necessary.
My PR linked if I recall was about 85% of the work needed for how the KLIP was originally proposed, with a few edge cases needing to be tidied up, and appropriate test cases written, documentation, etc..
From the comments on the KLIP, it sounds like there is another direction the project owners would like to go with it, but unfortunately it's unclear what that direction is...so it's difficult to contribute in a meaningful way.
Any updates here, @Crim or @abergmeier ?
I have taken @Crim 's changes and ran with them a decent amount further in this PR.
These changes add full support for nested protobufs, auto-schema generation (I tried to copy the existing example for adding fields from avro), CSAS support WITH (value_format='PROTOBUF').
I believe in regards to the UX I could fairly easily add a classpath scanner which looks for implementation of the Message interface which could be hooked up to some 'SHOW PROTOBUFS;' command. There should be a decently easy way of adding jar's the classpath for a dockerized ksql, currently I'm just building the container with my additional jar's, but that's not a great solution. Having some ENV_VAR besides KSQL_CLASSPATH which could be optionally combined onto KSQL_CLASSPATH could be an answer then one could volume mount in a folder with jar's.
If there's a specific way you guys would like to see these changes implemented I probably wouldn't mind making them.
I do have a question in regards to creating a table from a protobuf topic. ROWKEY/ROWTIME are null, I've tried to hunt it down but have failed to see how you carry that info through for the other value formats.
Any input on my changes would be greatly appreciated or some direction on where to go next.
We had a productive meeting on Friday about this with some members of Confluent, @arbiter34, @bsvingen, and a few other folks. Here are the notable discussion points:
- For most use cases, it's important to retain the formatting of messages in Protobuf between sources and sinks, given that KSQL is often being used for streaming-ETL. Consumers shouldn't need to differentiate the format depending on whether a KSQL query has been applied to the data.
- It might be okay for some situations to simply be able to read sources formatted in protobuf, and convert them to something like JSON on the way out, but it's definitely less applicable.
- There are plenty of situations where there is only one, or a small number, of protobuf schemas for the entire system. In these settings, even if Schema Registry supported protobuf, it can be desirable to statically load the schema files into KSQL and avoid running SR altogether.
For next steps, we'd like to do a little more verification that it'd be generally useful to support protobuf unattached from Schema Registry.
Thanks for the meeting notes, @MichaelDrogalis.
Couple of questions:
- What was the feedback regarding schema evolution?
For most use cases, it's important to retain the formatting of messages in Protobuf between sources and sinks, given that KSQL is often being used for streaming-ETL. Consumers shouldn't need to differentiate the format depending on whether a KSQL query has been applied to the data.
- Can you elaborate on the "to retain the formatting of messages in Protobuf between sources and sinks"? FWIW, we have a related ticket Support a VARBINARY, BINARY, or BYTES data type #1742, where a key motivation is streaming ETL, with the idea to support passing data through KSQL as-is, i.e. without any modification.
There are plenty of situations where there is only one, or a small number, of protobuf schemas for the entire system. In these settings, even if Schema Registry supported protobuf, it can be desirable to statically load the schema files into KSQL and avoid running SR altogether.
- The same argument can be made for Avro. But we deliberately decided to require SR for Avro support, and to not allow things like statically loading schema files into KSQL (for Avro). Why would the situation be different for Protobuf?
- What was the feedback regarding schema evolution?
Not sure which part of my notes this is a question towards, but pretty much anyone using protobuf is going to want backward compatibility. @bsvingen mentioned that their system can run for a while without upgrading when a new evolution of the schema is available.
- Can you elaborate on the "to retain the formatting of messages in Protobuf between sources and sinks"? FWIW, we have a related ticket Support a VARBINARY, BINARY, or BYTES data type #1742, where a key motivation is streaming ETL, with the idea to support passing data through KSQL as-is, i.e. without any modification.
This is a bit simpler. I was just saying that anyone using protobuf as the format on the way in probably wants protobuf as the format on the way out too. That is, not supporting a protobuf output path undercuts a lot of the value.
- The same argument can be made for Avro. But we deliberately decided to require SR for Avro support, and to not allow things like statically loading schema files into KSQL (for Avro). Why would the situation be different for Protobuf?
Could be worth revisiting that decision. Just a few data points, but when the number of schemas is small (for instance, only 1 in @bsvingen's case), the complexity of running SR can be greater than maintaining the schemas manually and loading them in. But again, just a data point.
@MichaelDrogalis Could you share any general timeline as to when we can expect Protobuf support? The feat request is pretty old by now. My organisation is considering KSQL but Protobuf support is a hard requirement here. Are you aware of any sensible workarounds in the mean time?
Hi @vyrwu. We aren't working on this one right now, and unfortunately, the proposed implementation isn't general enough to merge.
Where we're stuck is that we need dynamic protobuf schema support for internal topics. This would get used anytime there is an aggregate or really anything that changes the input schema. This would work if users were ok with the final format being Avro or JSON because it would avoid those issues, but it doesn't seem all that acceptable to me.
Hi, any new about this?
@jsolana Nothing to share yet. We haven't made progress on the above challenge.
Hey, any news about this? Protobuf is a must here for the Ksql usage but we do not care which format internal topics is. Just thumbing this up :)... :+1:
@marcosArruda Coming soon! The commits we need just landed in Schema Registry a bit ago. :) https://github.com/confluentinc/schema-registry/pull/1285
The work on our end is kicking off now.
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.