vector icon indicating copy to clipboard operation
vector copied to clipboard

Add `protobuf` codec

Open jszwedko opened this issue 3 years ago • 4 comments

A user wants to deserialize Kafka messages as protobuf with a provided type definition.

It looks like https://docs.rs/protofish/0.4.0/protofish/ might support this.

cc/ @pablosichert I wonder how hard you think this might be to add?

jszwedko avatar Oct 21 '21 16:10 jszwedko

If you guys are into this, let's not forget about Avro please! 🤞 Btw, Schema Registry support is umbrella to Protobuf and Avro, as the SR can have JSON, ProtoBuf and Avro schemas. Therefore, having all these codecs first and then later SR support for all these is a very good strategy.

hhromic avatar Oct 21 '21 16:10 hhromic

Strong agree. I think we can support AVRO and Protobuf with runtime definitions first and add support for schema registry later.

https://github.com/apache/avro/tree/master/lang/rust seems like it supports runtime parsing of AVRO.

jszwedko avatar Oct 21 '21 17:10 jszwedko

Agree, these will be very useful additions 👍

pablosichert avatar Oct 26 '21 13:10 pablosichert

:+1:

franklymrshankley avatar Jun 26 '22 07:06 franklymrshankley

Hi, I would like to implement this feature. I tried to play around with it and did a little rust program that reads a desc file (binary of FileDescriptorSet, just like kafdrop) which will later be passed in vector config file. I thought about using the desc file since it contains all the needed info to parse the message. The parsing itself uses protobuf & serde_protobuf & serde_value (sadly seems serde_protobuf doesn't support working with prost)

in the config file, the user will provide the following:

  • path to the desc file
  • the type of the protobuf message to parse from I am basing ProtobufDeserializer on JsonDeserializer as reference, please let me know what you think about this approach? Thanks.

Daniel599 avatar May 28 '23 22:05 Daniel599

@Daniel599 Can you actually provide us a link to any of this code? Hard to evaluate without seeing a proof of concept.

tobz avatar Jun 01 '23 20:06 tobz

while making the requested demo, I found out the crate serde_protobuf uses old protobuf crate version, so I dug around some more and there are two options:

  • protobuf crate version 3: has dynamic parsing
  • prost & prost-reflect

I saw vector uses prost, so I chose the later, also the API of prost-reflect seems easier.

@tobz you can see a demo in the following repo: https://github.com/Daniel599/protobuf_desc_demo

notes:

  • The build.rs part won't be added into vector.
  • The path of test_protobuf.desc & "test_protobuf.Person" will be configurable in the final version

P.S it's only the 4th time I actually write something meaningful in Rust, so any advice on improvements will be great.

Daniel599 avatar Jun 07 '23 22:06 Daniel599

@Daniel599 Looking at your example, I think this approach would be acceptable to us.

Just talking out loud, to write down some thoughts for when we review this work later:

  • This demo will grab all fields, even if they weren't actually set in the message. We'll likely want to make sure we don't do that.
  • Similarly, the demo currently emits enum fields as their numerical (on-the-wire) value. We'd likely want to use the variant name when transforming to vrl::Value.
  • The performance doesn't look terrible. Between getting a serde_json::Value from serde_json::from_str and the DynamicMessage + to_vrl(DynamicMessage) -> vrl::Value approaches, they're generally both fairly close: both are in the 4-5µs per call range with some simple timing added to the above demo.
  • One annoying thing is that prost_reflect has no way to consume owned versions of the fields, which would be useful to avoid a ton of microallocations, but we can think about that when there's an actual PR to review.

tobz avatar Jun 14 '23 12:06 tobz

@tobz I've initiated a pull request for this feature, but I still have some TODO's left for which I would appreciate your feedback.

Daniel599 avatar Jul 20 '23 14:07 Daniel599

seems like I had to write "fixes:" in my PR instead of "addresses:". I think this issue can be closed now, if not, let me know what is missing.

Daniel599 avatar Aug 04 '23 19:08 Daniel599

Agreed, thanks @Daniel599! Closed by https://github.com/vectordotdev/vector/pull/18019

jszwedko avatar Aug 04 '23 19:08 jszwedko