vector
vector copied to clipboard
Add `protobuf` codec
A user wants to deserialize Kafka messages as protobuf with a provided type definition.
It looks like https://docs.rs/protofish/0.4.0/protofish/ might support this.
cc/ @pablosichert I wonder how hard you think this might be to add?
If you guys are into this, let's not forget about Avro please! 🤞 Btw, Schema Registry support is umbrella to Protobuf and Avro, as the SR can have JSON, ProtoBuf and Avro schemas. Therefore, having all these codecs first and then later SR support for all these is a very good strategy.
Strong agree. I think we can support AVRO and Protobuf with runtime definitions first and add support for schema registry later.
https://github.com/apache/avro/tree/master/lang/rust seems like it supports runtime parsing of AVRO.
Agree, these will be very useful additions 👍
:+1:
Hi,
I would like to implement this feature.
I tried to play around with it and did a little rust program that reads a desc file (binary of FileDescriptorSet, just like kafdrop)
which will later be passed in vector config file.
I thought about using the desc file since it contains all the needed info to parse the message.
The parsing itself uses protobuf
& serde_protobuf
& serde_value
(sadly seems serde_protobuf
doesn't support working with prost
)
in the config file, the user will provide the following:
- path to the desc file
- the type of the protobuf message to parse from
I am basing
ProtobufDeserializer
onJsonDeserializer
as reference, please let me know what you think about this approach? Thanks.
@Daniel599 Can you actually provide us a link to any of this code? Hard to evaluate without seeing a proof of concept.
while making the requested demo, I found out the crate serde_protobuf uses old protobuf crate version, so I dug around some more and there are two options:
- protobuf crate version 3: has dynamic parsing
- prost & prost-reflect
I saw vector uses prost, so I chose the later, also the API of prost-reflect seems easier.
@tobz you can see a demo in the following repo: https://github.com/Daniel599/protobuf_desc_demo
notes:
- The build.rs part won't be added into vector.
- The path of test_protobuf.desc & "test_protobuf.Person" will be configurable in the final version
P.S it's only the 4th time I actually write something meaningful in Rust, so any advice on improvements will be great.
@Daniel599 Looking at your example, I think this approach would be acceptable to us.
Just talking out loud, to write down some thoughts for when we review this work later:
- This demo will grab all fields, even if they weren't actually set in the message. We'll likely want to make sure we don't do that.
- Similarly, the demo currently emits enum fields as their numerical (on-the-wire) value. We'd likely want to use the variant name when transforming to
vrl::Value
. - The performance doesn't look terrible. Between getting a
serde_json::Value
fromserde_json::from_str
and theDynamicMessage
+to_vrl(DynamicMessage) -> vrl::Value
approaches, they're generally both fairly close: both are in the 4-5µs per call range with some simple timing added to the above demo. - One annoying thing is that
prost_reflect
has no way to consume owned versions of the fields, which would be useful to avoid a ton of microallocations, but we can think about that when there's an actual PR to review.
@tobz I've initiated a pull request for this feature, but I still have some TODO's left for which I would appreciate your feedback.
seems like I had to write "fixes:" in my PR instead of "addresses:". I think this issue can be closed now, if not, let me know what is missing.
Agreed, thanks @Daniel599! Closed by https://github.com/vectordotdev/vector/pull/18019