data-prepper icon indicating copy to clipboard operation
data-prepper copied to clipboard

Tagging Events in Data Prepper

Open graytaylor0 opened this issue 3 years ago • 3 comments

Is your feature request related to a problem? Please describe. Currently, Data Prepper does not support a standard concept for adding descriptions dynamically for what happens in individual sources, processors, and sinks. For example, if the grok processor fails to match, the user should be able to look at an Event and tell that it was not able to match. Otherwise, it can feel like grok is simply not working at all. Another example would be a json processor failing to parse json. The user needs to know if the parsing failed in order to pinpoint the problems with their configuration. Additionally, many users would like to check for certain tags when routing, or drop Events with those tags to save space. Lastly, users of OpenSearch would like to query based on tags in Events. Data Prepper needs a concept for easily handling these types of situations for any sink, processor, or source.

Describe the solution you'd like A key that is dedicated to adding information of this sort. For example, if grok fails to match, then the event will add a tags key with [grok_match_failure] as a value. The value of tags will be a Set, as it is not helpful to have duplicate tags.

{ "message": "a log", "tags": ["grok_match_failure"] }

Now a grok user is able to quickly tell that there was no match for this Event. If this Event then went through a json parser and failed to parse, you can add json_parse_failure to the tags key, like this.

{ "message": "a log", "tags": ["grok_match_failure", "json_parse_failure"] }

To make this functionality consistent between plugins, the Event class could have a new function

void addTag(String tagName);

which would add tagName to the set of tags.

In order to separate tags from the rest of the Event, checking for tags with conditional expressions would look like this:

drop:
     when: 'event.hasTag("grok_parse_failure")'

While I don't believe it to be required for the first iteration of tagging, a processor to control the adding and removing of custom tags could exist. It would look something like this:

processor:
     - tag_manager:
          add_tags: ["tag3", "tag4"]
          remove_tags: ["tag1", "tag2"]

This processor could also be split into two, with one called add_tags and one called remove_tags.

Describe alternatives you've considered (Optional) This problem could be solved at the plugin level. So given the same scenario with grok match failure, the Event would become something like

{ "message": "a log", "grok_match_failure": true }

and then after the json parsing fails, the Event would become something like

{ "message": "a log", "grok_match_failure": true, "json_parse_failure": true }

As you can tell, this solution doesn't scale as well. You can imagine that with a large amount of sources, processors, and sinks adding their own booleans to an Event, the Event could quickly become cluttered, and the querying for tags in OpenSearch would also become more of a pain.

Add the tags to the EventMetadata

Additionally, the tags could be added to the EventMetadata rather than the actual Event. This would make the Events cleaner, and the overall tagging options more configurable and extracted from the event data itself. The EventMetadata would contain the following:

Set<String> getTags();

This approach would allow for conditional checks on tags, but it needs a little more implementation to make the tags a part of the sink output. For example, the OpenSearch sink could have a configuration option like the following:

opensearch:
  host: ["localhost:9200"]
  save_tags: false (default would be true)

This would give individual sinks the ability to configure tags however they please (they could change the name of the tags key or remove certain tags at the sink level)

The one concern with this is that it would result in some unnecessary duplicate code, but it is entirely likely that some sinks would like to have the tags in the Event itself, and some would not. To make some options like the save_tags logic reusable, a plugin could be created that would handle the logic for adding the tags from the EventMetadata to the Event itself before it is shipped to the sink.

Additional context Please provide alternatives to solve this problem if there are other ideas that make more sense than the tagging concept described here.

graytaylor0 avatar Nov 19 '21 02:11 graytaylor0

Thanks for proposal. Yes, this would be helpful.

Regarding the idea of using an array, I believe this is also best for OpenSearch documents. It should be easier to query documents by querying matches on tag than doing a boolean match.

I think it would be appropriate to make this a concept directly on Event. This should promote consistency across plugins.

Perhaps the following method can be added to Event:

void addTag(String tagName);

dlvenable avatar Dec 16 '21 21:12 dlvenable

I have a few questions about putting the tags field directly in the event:

  • Should the tags field be a protected field in the event? Can rename_key, 'copy_key', 'delete_key', etc. alter the tags?
  • What happens if an event contains a tags that is not a set?

I think there is some potential for some poor experiences depending on the processors/events default behavior if we put the tags directly in the event.

cmanning09 avatar Jul 14 '22 21:07 cmanning09

@cmanning09 After thinking about your comments for a while, I think I am more in favor of tags being a part of the EventMetadata, and then creating a plugin that can be used by all sinks to handle tagging. This way, all the functionality of choosing the tag name and whether they are sent to the sink as part of the Event is configurable, and it can be made very clear that mutate processors cannot be configured by the user to directly alter the tags. This approach also makes the conditional expressions based on tags less confusing, as it would be more apparent that /tags or hasKey("tags") would actually check the Event for a literal "tags" key, while getTags() == { "grok_match_failure", "json_parse_failure" } or hasTag("grok_match_failure") == false would only check the tags in the EventMetadata. The full plugin would look something like this by default

- opensearch:
     tagging:
        include_tags_in_event: false
        tags_key_name: "tags"

And this configuration would not include the tags as part of the Event sent to the OpenSearch sink.

and this same plugin could be utilized by, for example, an s3 sink as well (in this case it is sending the tags to the sink as part of the Event under the newTagsKey key.

- s3:
     tagging:
        include_tags_in_event: true
        tags_key_name: "newTagsKey"

This way, tags are completely decoupled from the Event until right before they are sent to the sink, and could be serialized into the Event with a separate method on Event such as event.toJsonStringWithTags("newTagsKey") for the sink plugins to use if include_tags_in_event is true and tag_key_name is newTagsKey. Even if the user had a tags Integer in the Event already and still used the default of tags for tag_key_name (which they should know not to do), nothing would collide and cause errors or data overwriting, and there would just be 2 different json keys for tags. This also provides some interesting query functionality in OpenSearch for the case of multiple sinks that use the same OpenSearch cluster with the same index, as one would still be able to figure out which path an Event took quickly by querying on the assigned tag key.

graytaylor0 avatar Jul 21 '22 01:07 graytaylor0

@kkondaka , I re-opened this issue to be sure we don't lose anything. Is there anything else necessary here?

dlvenable avatar May 15 '23 16:05 dlvenable

@dlvenable it depends on what this issue includes. We have to add support to add tags at many places. Not sure if this issue is supposed to be for all those cases.

kkondaka avatar May 15 '23 16:05 kkondaka

I think the processor also should have a support for condition optionally. So, the processor config should look something likes

processor:
   - tag_manager:
          add_tags: ["tag3", "tag4"]
          add_when: <condition1>
          remove_tags: ["tag1", "tag2"]
          remove_when: <condition2>

kkondaka avatar May 19 '23 00:05 kkondaka

@kkondaka , I think the only item left here is to include tags in the documents for OpenSearch.

If we don't get this in for 2.3, can you create a new GitHub issue for this addition?

dlvenable avatar May 26 '23 17:05 dlvenable

Completed by #2745 #2690 #2680 #2629

kkondaka avatar Jun 05 '23 20:06 kkondaka

The feature for writing tags to OpenSearch documents was not included in this version and will be worked as part of #2827.

dlvenable avatar Jun 05 '23 21:06 dlvenable