data-prepper icon indicating copy to clipboard operation
data-prepper copied to clipboard

Enhance Key Value Processor

Open kkondaka opened this issue 10 months ago • 10 comments

Is your feature request related to a problem? Please describe. KeyValue processor tries to find key value pairs in entire source field. But it is possible that the key value pairs are only present in the small substring of the entire source field. Need a way to reduce the scope of finding the key value pairs in a given field.

Describe the solution you'd like

  1. Enhance Key Value processor to pick start-idx (default 0) and end-idx (default end of string) in a given input field string for finding key value pairs. For example -
 key_value: 
     source: message
     start_idx : 5
     end_idx = 15
  1. Enhance Key Value processor to pick starting string pattern and ending string-pattern. Finds key value pattern only between the two patterns. For example -
 key_value: 
     source: message
     start_pattern : "("
     end_pattern = ")"
  1. Enhance Key Value processor to find only given keys. Finds only the specified keys in the source. This can be combined with any of the above two start/end index/pattern options For example -
  key_value: 
     source: message
     include_keys: [ "key1", "key2", "key3" ]
  1. Start processing key values based on a separator like "=" and pick key from the left side and value from the right side with with option to group values like [value1, value2] (so comma cannot be used as field separator blindly) and also some values like url should ignore http arguments after "?" For example, url=<something>?x=y should be just url=<something>

Some examples

  1. Text like <some text> k1=v1,k2=v2 <other text> should only extract key values k1 and k2 like {"k1": "v1", "k2:"v2"}
  2. If the value has a group of values, they should be treated as single value. For example, k1=[v1, v2],k3=v3 should generate {"k1":["v1", "v2"], "k3": "v3"}
  3. values should be able ignore some parts of them. For example url=http://abcd.com?k1=v1&k2=v2 should have an option to generate just {"url":"http://abcd.com"} and not include the text after "?"

Describe alternatives you've considered (Optional) A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

kkondaka avatar Apr 22 '24 06:04 kkondaka

Enhance Key Value processor to pick start-idx (default 0) and end-idx (default end of string) in a given input field string for finding key value pairs. For example -

This is a specific solution for a specific problem. We may hit this in other processor such as grok or parse_json. What if we offer expressions in the source and provide a substring function?

key_value: 
     source: '${substring(message, 5, 15)}'

This approach seems more flexible and could be re-used to solve other problems as they arise.

dlvenable avatar Apr 23 '24 18:04 dlvenable

Enhance Key Value processor to find only given keys. Finds only the specified keys in the source. This can be combined with any of the above two start/end index/pattern options For example -

We use the terms include_keys or just keys in other processors. Can we use one of those instead of find_keys for consistency?

dlvenable avatar Apr 23 '24 18:04 dlvenable

@dlvenable updated as per your suggestion

kkondaka avatar Apr 23 '24 19:04 kkondaka

@kkondaka I think this is not correct:

values should be able ignore some parts of them. For example url=http://abcd.com?k1=v1&k2=v2 should have an option to generate just {"url":"http://abcd.com"} and not include the text after "?"

The correct version should be:

values should be able ignore some parts of them. For example url=http://abcd.com?k1=v1&k2=v2 should have an option to generate just {"url":"http://abcd.com?k1=v1&k2=v2"}

We do not want to loose the full url, why would you only want to parse out the domain?

If they key value separator should be definable and most likely a "," is a most sensible default. However you should explain in your description what would happen in scenarios like this:

<some text> k1=v1,url=http://foo.com?bar=text,text&foo=zoo <other text>

What we certainly don't want to end up with:

{"k1":"v1", "url":"http://foo.com?bar=text", "text&foo":"zoo"}

I don't think I have good proposals for this except asking users that url values must always be encoded and never include the key/value separator in plain text. Do you have better ideas?

zsaltys avatar Apr 26 '24 11:04 zsaltys

@zsaltys Thank you very much for your comments. I agree with your last statement/paragraph. It is very ambiguous if "," and "=" exist inside value and they should be treated differently. That's the reason usually URLs have them as encoded values like "," as %44 and "=" as %61. But do yahoo logs encode them or not? If not, what do you suggest we should do?

kkondaka avatar Apr 26 '24 18:04 kkondaka

@kkondaka I can't answer for all of Yahoo unfortunately if they encode all their logs are not but I know there are teams who do not..

We looked a bit at Splunk and it seems it's able to handle this correctly (most of the time). We believe what it does is that it does not just purely split on comma but it actually checks if what follows after a comma matches a pattern of key=value. In other words when Splunk splits it likely is looking for a pattern like ,key=value not just ,

So in our example:

foo k1=v1,url=http://foo.com?bar=text,text&foo=zoo bar,k2=v2

Splunk will parse out:

k1=v1
url=http://foo.com?bar=text,text&foo=zoo bar (they don't always get this correctly)
bar=text,text (splunk will look inside urls)
foo=zoo (splunk will look inside urls)
k2=v2

We would be OK if the result simply was:

k1=v1
url=http://foo.com?bar=text,text&foo=zoo bar
k2=v2

It seems the the splitter logic should be a little bit more involved and the split pattern should be something like /,[^&]+=/

zsaltys avatar Apr 29 '24 10:04 zsaltys

@zsaltys Thanks for the info but this looks very confusing and "Splunk seems to handle it most of the time correctly" is not really giving much confidence. It looks like there is lot of ambiguity here to implement this in a generic way. I think generic way of handling this is to handle URL values (those starting with "http://" ) differently.

kkondaka avatar Apr 30 '24 15:04 kkondaka

@zsaltys what happens in the following case

foo k1=v1,url=http://foo.com?bar=text,text&foo=zoo bar,k2=http://bar.com?a=b&c=d

In this case there is an "&" after "bar," how do we know how to parse this? You regular expression "/,[^&]+=/" will not work in this case. Unless, we have a clear demarcation of where url ends, it would be very difficult to implement this correctly.

kkondaka avatar Apr 30 '24 20:04 kkondaka

@zsaltys ,

Thank you for adding more information on the requirements. One possible solution would be to look at a URL and then run a key-value set on this. But, this might look like the following:

k1=v1
url=http://foo.com?bar=text,text&foo=zoo
bar=text,text
foo=zoo
k2=v2

We could detect that the URL ended. But, you wouldn't have the bar at the end. Do you want that part?

dlvenable avatar Apr 30 '24 20:04 dlvenable

With the example in @kkondaka 's comment.

foo k1=v1,url=http://foo.com?bar=text,text&foo=zoo bar,k2=http://bar.com?a=b&c=d

We'd get:

k1=v1
url=http://foo.com?bar=text,text&foo=zoo
bar=text,text
foo=zoo
k2=http://bar.com?a=b&c=d
a=b
c=d

dlvenable avatar Apr 30 '24 20:04 dlvenable

I think for a basic implementation I propose to focus on a single delimiter with support for escaping (same as CSV). The default delimiter should be space (configurable) and the default escape character should be " (configurable)

Given input like this:

foo k1=v1 url=http://foo.com?bar=text,text&foo=zoo bar k2="http://bar.com?a=b&c=foo bar" har

Output in query_params should be extracted as:

k1=v1
url=http://foo.com?bar=text,text&foo=zoo
k2=http://bar.com?a=b&c=foo bar

zsaltys avatar May 02 '24 19:05 zsaltys

@zsaltys thanks for the clarification.

kkondaka avatar May 02 '24 19:05 kkondaka