data-prepper
data-prepper copied to clipboard
Tagging Events in Data Prepper
Is your feature request related to a problem? Please describe.
Currently, Data Prepper does not support a standard concept for adding descriptions dynamically for what happens in individual sources, processors, and sinks. For example, if the grok processor fails to match, the user should be able to look at an Event and tell that it was not able to match. Otherwise, it can feel like grok is simply not working at all. Another example would be a json
processor failing to parse json. The user needs to know if the parsing failed in order to pinpoint the problems with their configuration. Additionally, many users would like to check for certain tags when routing, or drop Events with those tags to save space. Lastly, users of OpenSearch would like to query based on tags in Events. Data Prepper needs a concept for easily handling these types of situations for any sink, processor, or source.
Describe the solution you'd like
A key that is dedicated to adding information of this sort. For example, if grok fails to match, then the event will add a tags
key with [grok_match_failure]
as a value. The value of tags
will be a Set
, as it is not helpful to have duplicate tags.
{ "message": "a log", "tags": ["grok_match_failure"] }
Now a grok user is able to quickly tell that there was no match for this Event. If this Event then went through a json parser and failed to parse, you can add json_parse_failure
to the tags
key, like this.
{ "message": "a log", "tags": ["grok_match_failure", "json_parse_failure"] }
To make this functionality consistent between plugins, the Event
class could have a new function
void addTag(String tagName);
which would add tagName
to the set of tags
.
In order to separate tags from the rest of the Event, checking for tags with conditional expressions would look like this:
drop:
when: 'event.hasTag("grok_parse_failure")'
While I don't believe it to be required for the first iteration of tagging, a processor to control the adding and removing of custom tags could exist. It would look something like this:
processor:
- tag_manager:
add_tags: ["tag3", "tag4"]
remove_tags: ["tag1", "tag2"]
This processor could also be split into two, with one called add_tags
and one called remove_tags
.
Describe alternatives you've considered (Optional) This problem could be solved at the plugin level. So given the same scenario with grok match failure, the Event would become something like
{ "message": "a log", "grok_match_failure": true }
and then after the json parsing fails, the Event would become something like
{ "message": "a log", "grok_match_failure": true, "json_parse_failure": true }
As you can tell, this solution doesn't scale as well. You can imagine that with a large amount of sources, processors, and sinks adding their own booleans to an Event, the Event could quickly become cluttered, and the querying for tags in OpenSearch would also become more of a pain.
Add the tags to the EventMetadata
Additionally, the tags could be added to the EventMetadata
rather than the actual Event
. This would make the Events cleaner, and the overall tagging options more configurable and extracted from the event data itself. The EventMetadata
would contain the following:
Set<String> getTags();
This approach would allow for conditional checks on tags, but it needs a little more implementation to make the tags a part of the sink output. For example, the OpenSearch sink could have a configuration option like the following:
opensearch:
host: ["localhost:9200"]
save_tags: false (default would be true)
This would give individual sinks the ability to configure tags however they please (they could change the name of the tags
key or remove certain tags at the sink level)
The one concern with this is that it would result in some unnecessary duplicate code, but it is entirely likely that some sinks would like to have the tags in the Event itself, and some would not. To make some options like the save_tags
logic reusable, a plugin could be created that would handle the logic for adding the tags from the EventMetadata
to the Event
itself before it is shipped to the sink.
Additional context Please provide alternatives to solve this problem if there are other ideas that make more sense than the tagging concept described here.
Thanks for proposal. Yes, this would be helpful.
Regarding the idea of using an array, I believe this is also best for OpenSearch documents. It should be easier to query documents by querying matches on tag
than doing a boolean match.
I think it would be appropriate to make this a concept directly on Event
. This should promote consistency across plugins.
Perhaps the following method can be added to Event
:
void addTag(String tagName);
I have a few questions about putting the tags
field directly in the event:
- Should the
tags
field be a protected field in the event? Canrename_key
, 'copy_key', 'delete_key', etc. alter the tags? - What happens if an event contains a
tags
that is not a set?
I think there is some potential for some poor experiences depending on the processors/events default behavior if we put the tags directly in the event.
@cmanning09 After thinking about your comments for a while, I think I am more in favor of tags being a part of the EventMetadata
, and then creating a plugin that can be used by all sinks to handle tagging. This way, all the functionality of choosing the tag name and whether they are sent to the sink as part of the Event is configurable, and it can be made very clear that mutate processors cannot be configured by the user to directly alter the tags. This approach also makes the conditional expressions based on tags less confusing, as it would be more apparent that /tags
or hasKey("tags")
would actually check the Event for a literal "tags" key, while getTags() == { "grok_match_failure", "json_parse_failure" }
or hasTag("grok_match_failure") == false
would only check the tags in the EventMetadata. The full plugin would look something like this by default
- opensearch:
tagging:
include_tags_in_event: false
tags_key_name: "tags"
And this configuration would not include the tags as part of the Event sent to the OpenSearch sink.
and this same plugin could be utilized by, for example, an s3 sink as well (in this case it is sending the tags to the sink as part of the Event under the newTagsKey
key.
- s3:
tagging:
include_tags_in_event: true
tags_key_name: "newTagsKey"
This way, tags
are completely decoupled from the Event until right before they are sent to the sink, and could be serialized into the Event with a separate method on Event such as event.toJsonStringWithTags("newTagsKey")
for the sink plugins to use if include_tags_in_event
is true and tag_key_name
is newTagsKey
. Even if the user had a tags
Integer in the Event already and still used the default of tags
for tag_key_name (which they should know not to do), nothing would collide and cause errors or data overwriting, and there would just be 2 different json keys for tags
. This also provides some interesting query functionality in OpenSearch for the case of multiple sinks that use the same OpenSearch cluster with the same index, as one would still be able to figure out which path an Event took quickly by querying on the assigned tag key.
@kkondaka , I re-opened this issue to be sure we don't lose anything. Is there anything else necessary here?
@dlvenable it depends on what this issue includes. We have to add support to add tags at many places. Not sure if this issue is supposed to be for all those cases.
I think the processor also should have a support for condition optionally. So, the processor config should look something likes
processor:
- tag_manager:
add_tags: ["tag3", "tag4"]
add_when: <condition1>
remove_tags: ["tag1", "tag2"]
remove_when: <condition2>
@kkondaka , I think the only item left here is to include tags in the documents for OpenSearch.
If we don't get this in for 2.3, can you create a new GitHub issue for this addition?
Completed by #2745 #2690 #2680 #2629
The feature for writing tags to OpenSearch documents was not included in this version and will be worked as part of #2827.