data-prepper
data-prepper copied to clipboard
Enhance Key Value Processor
Is your feature request related to a problem? Please describe.
KeyValue processor tries to find key value pairs in entire source
field. But it is possible that the key value pairs are only present in the small substring of the entire source field. Need a way to reduce the scope of finding the key value pairs in a given field.
Describe the solution you'd like
- Enhance Key Value processor to pick start-idx (default 0) and end-idx (default end of string) in a given input field string for finding key value pairs. For example -
key_value:
source: message
start_idx : 5
end_idx = 15
- Enhance Key Value processor to pick starting string pattern and ending string-pattern. Finds key value pattern only between the two patterns. For example -
key_value:
source: message
start_pattern : "("
end_pattern = ")"
- Enhance Key Value processor to find only given keys. Finds only the specified keys in the source. This can be combined with any of the above two start/end index/pattern options For example -
key_value:
source: message
include_keys: [ "key1", "key2", "key3" ]
- Start processing key values based on a separator like "=" and pick key from the left side and value from the right side with with option to group values like
[value1, value2]
(so comma cannot be used as field separator blindly) and also some values likeurl
should ignore http arguments after "?" For example,url=<something>?x=y
should be justurl=<something>
Some examples
- Text like
<some text> k1=v1,k2=v2 <other text>
should only extract key values k1 and k2 like{"k1": "v1", "k2:"v2"}
- If the value has a group of values, they should be treated as single value. For example,
k1=[v1, v2],k3=v3
should generate{"k1":["v1", "v2"], "k3": "v3"}
- values should be able ignore some parts of them. For example
url=http://abcd.com?k1=v1&k2=v2
should have an option to generate just{"url":"http://abcd.com"}
and not include the text after "?"
Describe alternatives you've considered (Optional) A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.
Enhance Key Value processor to pick start-idx (default 0) and end-idx (default end of string) in a given input field string for finding key value pairs. For example -
This is a specific solution for a specific problem. We may hit this in other processor such as grok
or parse_json
. What if we offer expressions in the source
and provide a substring
function?
key_value:
source: '${substring(message, 5, 15)}'
This approach seems more flexible and could be re-used to solve other problems as they arise.
Enhance Key Value processor to find only given keys. Finds only the specified keys in the source. This can be combined with any of the above two start/end index/pattern options For example -
We use the terms include_keys
or just keys
in other processors. Can we use one of those instead of find_keys
for consistency?
@dlvenable updated as per your suggestion
@kkondaka I think this is not correct:
values should be able ignore some parts of them. For example url=http://abcd.com?k1=v1&k2=v2 should have an option to generate just {"url":"http://abcd.com"} and not include the text after "?"
The correct version should be:
values should be able ignore some parts of them. For example url=http://abcd.com?k1=v1&k2=v2 should have an option to generate just {"url":"http://abcd.com?k1=v1&k2=v2"}
We do not want to loose the full url, why would you only want to parse out the domain?
If they key value separator should be definable and most likely a "," is a most sensible default. However you should explain in your description what would happen in scenarios like this:
<some text> k1=v1,url=http://foo.com?bar=text,text&foo=zoo <other text>
What we certainly don't want to end up with:
{"k1":"v1", "url":"http://foo.com?bar=text", "text&foo":"zoo"}
I don't think I have good proposals for this except asking users that url values must always be encoded and never include the key/value separator in plain text. Do you have better ideas?
@zsaltys Thank you very much for your comments. I agree with your last statement/paragraph. It is very ambiguous if "," and "=" exist inside value and they should be treated differently. That's the reason usually URLs have them as encoded values like "," as %44 and "=" as %61. But do yahoo logs encode them or not? If not, what do you suggest we should do?
@kkondaka I can't answer for all of Yahoo unfortunately if they encode all their logs are not but I know there are teams who do not..
We looked a bit at Splunk and it seems it's able to handle this correctly (most of the time). We believe what it does is that it does not just purely split on comma but it actually checks if what follows after a comma matches a pattern of key=value. In other words when Splunk splits it likely is looking for a pattern like ,key=value not just ,
So in our example:
foo k1=v1,url=http://foo.com?bar=text,text&foo=zoo bar,k2=v2
Splunk will parse out:
k1=v1
url=http://foo.com?bar=text,text&foo=zoo bar (they don't always get this correctly)
bar=text,text (splunk will look inside urls)
foo=zoo (splunk will look inside urls)
k2=v2
We would be OK if the result simply was:
k1=v1
url=http://foo.com?bar=text,text&foo=zoo bar
k2=v2
It seems the the splitter logic should be a little bit more involved and the split pattern should be something like /,[^&]+=/
@zsaltys Thanks for the info but this looks very confusing and "Splunk seems to handle it most of the time correctly" is not really giving much confidence. It looks like there is lot of ambiguity here to implement this in a generic way. I think generic way of handling this is to handle URL values (those starting with "http://" ) differently.
@zsaltys what happens in the following case
foo k1=v1,url=http://foo.com?bar=text,text&foo=zoo bar,k2=http://bar.com?a=b&c=d
In this case there is an "&" after "bar," how do we know how to parse this? You regular expression "/,[^&]+=/" will not work in this case. Unless, we have a clear demarcation of where url ends, it would be very difficult to implement this correctly.
@zsaltys ,
Thank you for adding more information on the requirements. One possible solution would be to look at a URL and then run a key-value set on this. But, this might look like the following:
k1=v1
url=http://foo.com?bar=text,text&foo=zoo
bar=text,text
foo=zoo
k2=v2
We could detect that the URL ended. But, you wouldn't have the bar
at the end. Do you want that part?
With the example in @kkondaka 's comment.
foo k1=v1,url=http://foo.com?bar=text,text&foo=zoo bar,k2=http://bar.com?a=b&c=d
We'd get:
k1=v1
url=http://foo.com?bar=text,text&foo=zoo
bar=text,text
foo=zoo
k2=http://bar.com?a=b&c=d
a=b
c=d
I think for a basic implementation I propose to focus on a single delimiter with support for escaping (same as CSV). The default delimiter should be space (configurable) and the default escape character should be " (configurable)
Given input like this:
foo k1=v1 url=http://foo.com?bar=text,text&foo=zoo bar k2="http://bar.com?a=b&c=foo bar" har
Output in query_params should be extracted as:
k1=v1
url=http://foo.com?bar=text,text&foo=zoo
k2=http://bar.com?a=b&c=foo bar
@zsaltys thanks for the clarification.