data-prepper
data-prepper copied to clipboard
S3 Event Decoding Consistency
Is your feature request related to a problem? Please describe.
The s3
source includes two codes in 1.5 and a new codec for CSV processing is coming in 2.0. These populate Events somewhat differently.
-
newline-delimited
-> The newline is saved to themessage
key of the Event. This is a single string. -
json
-> The JSON is expanded intomessage
. So, if the JSON has a key namedsourceIp
, it is populated in/message/sourceIp
. -
csv
-> Each key is expanded directly into the root of the Event (/
). Thus, if the CSV has a key namedsourceIp
, it is populated in/sourceIp
.
Also, the s3
processor adds two special keys to all Events: bucket
and key
. These indicate the S3 bucket and key, respectively, for the object. The S3 Processor populates this, not the Codecs.
Describe the solution you'd like
First, all codecs should put the data in the same place consistently. Second, we should decide where we want this data to reside (/message
or /
). Third, it should avoid conflicting with the bucket
and key
.
One possible solution is to change the s3
source to save the bucket
and key
to a top-level object named s3
. Then the codecs save to the root (/
). This could lead to conflicts if the actual data has a column or field named s3
. But, if we make this key configurable, then pipeline authors could potentially avoid this.
Describe alternatives you've considered (Optional)
An alternative would be more robust support for Event metadata. The bucket and key could be saved as metadata. However, Data Prepper's conditional routing and processors don't support Event metadata presently.
Additional context
- #251
- #1081
I like the suggestion that it should be configurable where we save the data to avoid conflicts.
I also like the suggestion that the destination (message
or /
) is configurable. Many processors use message
as the default source
, so changing the csv
codec to write keys to message
by default would preserve consistency with the other codecs and with other sources.
From my perspective, it's easier to deal with shallow (root-level) fields in OpenSearch and other processors because message/
doesn't need to be appended to the front of each field when doing a transformation on that field. However, other users might have a different perspective and use case, so having this be configurable could help everyone.
I like the suggestion that the key
and bucket
are within a top-level s3
key. We could get around conflicts by changing any conflicting keys to be an absolute path (if s3
is key in data and data is configured to write to root, then append message/s3
to root and keep s3/bucket
or s3/root
).