data-prepper icon indicating copy to clipboard operation
data-prepper copied to clipboard

Create a decoder processor to decode Event keys

Open graytaylor0 opened this issue 1 year ago • 6 comments

Is your feature request related to a problem? Please describe. As a user of Data Prepper, my Events contain keys that are encoded in different formats, such as gzip, base64, and protobuf (https://protobuf.dev/programming-guides/encoding/).

Sample Event

{
  "my_protobuf_key": "",
  "my_gzip_key": "H4sIAAAAAAAAA/NIzcnJVyjPL8pJAQBSntaLCwAAAA==",
  "my_base64_key": "SGVsbG8gd29ybGQ="
}

Describe the solution you'd like A new processor called a decoder processor that can decode various encodings. The following configuration example would decode the three values in the example Event above

processor:
  - decoder: 
       key: "my_base64_key"
       # Can be one of gzip, base64, or protobuf
       base64:
  - decoder:
       key: "my_gzip_key"
       gzip:
  - decoder:
       key: "my_protobuf_key"
       protobuf:
          message_definition_file: "/path/to/proto_definition.proto"

Tasks

  • [x] #4016
  • [ ] OTel decode processor
  • [ ] Decode encoded data (base64)

graytaylor0 avatar Dec 11 '23 15:12 graytaylor0

It may be advantageous to have a different processors for different encodings for a few reasons.

  1. Each of these brings in different dependencies and long-term we may want to make some of these plugins optional to keep the overall size of the project down.
  2. This could produce simpler YAML as we won't need to the nested group for custom configurations.
  3. It remains consistent with other processors. We already can decode/parse JSON, CSV, and now ION. They have their own processors.
processor:
  - decode_protobuf:
       key: my_prototbuf_key
       message_definition_file: "/path/to/proto_definition.proto"
  - decode_base64:
       key: my_base64_key

Compression might be a special case. Maybe we'd have a single processor for that. Though it wouldn't help with overall dependency reduction.

processor:
  - decompress:
       key: my_gzip_key
       type: gzip

dlvenable avatar Dec 11 '23 16:12 dlvenable

I think we need to support "/path/to/proto_definition.proto" to be S3 path as well, right?

kkondaka avatar Dec 11 '23 23:12 kkondaka

@kkondaka ,

I think we need to support "/path/to/proto_definition.proto" to be S3 path as well, right?

That is probably ideal, though it could also come as a follow-on based on feedback. Also I think we need to make a more general structure for getting data from S3, file path, etc. The current approach is rather cumbersome both for users and developers. We could do something similar to what we did with AWS Secrets and hopefully will do with environment variables.

dlvenable avatar Dec 12 '23 14:12 dlvenable

For protobuf decoding, what's expected format of the file /path/to/proto_definition.proto? Is it supposed to contain the definition of the messages, something like,

syntax = "proto2";

message ProtoBufMessage {
  // Define your message fields here
  // Example:
  required int32 intField = 1;
  required string strField = 2;
}

I think it would be difficult to support such cases because such files need to be compiled.

It looks like if the above file compiled and a descriptor file is created using the following command

protoc --descriptor_set_out=MyMessage.desc MyMessage.proto

Then using the file MyMessage.desc in path config in the DataPrepper message_definition_file configuration could work.

kkondaka avatar Dec 13 '23 22:12 kkondaka

@kkondaka , What exactly is the descriptor in this proposal? Is it the "File descriptor" JSON in the following documentation?

https://protobuf.com/docs/descriptors#message-descriptors

dlvenable avatar Dec 15 '23 18:12 dlvenable

We should also consider how to handle Protobuf imports.

dlvenable avatar Dec 15 '23 18:12 dlvenable