data-prepper
data-prepper copied to clipboard
Create a decoder processor to decode Event keys
Is your feature request related to a problem? Please describe.
As a user of Data Prepper, my Events contain keys that are encoded in different formats, such as gzip
, base64
, and protobuf
(https://protobuf.dev/programming-guides/encoding/).
Sample Event
{
"my_protobuf_key": "",
"my_gzip_key": "H4sIAAAAAAAAA/NIzcnJVyjPL8pJAQBSntaLCwAAAA==",
"my_base64_key": "SGVsbG8gd29ybGQ="
}
Describe the solution you'd like
A new processor called a decoder
processor that can decode various encodings. The following configuration example would decode the three values in the example Event above
processor:
- decoder:
key: "my_base64_key"
# Can be one of gzip, base64, or protobuf
base64:
- decoder:
key: "my_gzip_key"
gzip:
- decoder:
key: "my_protobuf_key"
protobuf:
message_definition_file: "/path/to/proto_definition.proto"
Tasks
- [x] #4016
- [ ] OTel decode processor
- [ ] Decode encoded data (base64)
It may be advantageous to have a different processors for different encodings for a few reasons.
- Each of these brings in different dependencies and long-term we may want to make some of these plugins optional to keep the overall size of the project down.
- This could produce simpler YAML as we won't need to the nested group for custom configurations.
- It remains consistent with other processors. We already can decode/parse JSON, CSV, and now ION. They have their own processors.
processor:
- decode_protobuf:
key: my_prototbuf_key
message_definition_file: "/path/to/proto_definition.proto"
- decode_base64:
key: my_base64_key
Compression might be a special case. Maybe we'd have a single processor for that. Though it wouldn't help with overall dependency reduction.
processor:
- decompress:
key: my_gzip_key
type: gzip
I think we need to support "/path/to/proto_definition.proto" to be S3 path as well, right?
@kkondaka ,
I think we need to support "/path/to/proto_definition.proto" to be S3 path as well, right?
That is probably ideal, though it could also come as a follow-on based on feedback. Also I think we need to make a more general structure for getting data from S3, file path, etc. The current approach is rather cumbersome both for users and developers. We could do something similar to what we did with AWS Secrets and hopefully will do with environment variables.
For protobuf decoding, what's expected format of the file /path/to/proto_definition.proto
? Is it supposed to contain the definition of the messages, something like,
syntax = "proto2";
message ProtoBufMessage {
// Define your message fields here
// Example:
required int32 intField = 1;
required string strField = 2;
}
I think it would be difficult to support such cases because such files need to be compiled.
It looks like if the above file compiled and a descriptor file is created using the following command
protoc --descriptor_set_out=MyMessage.desc MyMessage.proto
Then using the file MyMessage.desc
in path config in the DataPrepper message_definition_file
configuration could work.
@kkondaka , What exactly is the descriptor in this proposal? Is it the "File descriptor" JSON in the following documentation?
https://protobuf.com/docs/descriptors#message-descriptors
We should also consider how to handle Protobuf imports.