victorialogs-datasource icon indicating copy to clipboard operation
victorialogs-datasource copied to clipboard

incorrect interpolation of multi-value variable in regex

Open pgassmann opened this issue 9 months ago • 15 comments

the query: host:~'^$host$' with host variable multi select of backup3 and backup4

is interpolated to: host:~'^"(backup3|backup4)"$' (as seen in the query inspector)

There are additional quotes around the regex.

Expected query: host:~'^(backup3|backup4)$'

Background Use Case:

I would like to filter logs based on multiple selected values, of e.g. the host label

The dashboard variables returns multiple values formatted regex (foo|bar|three). if using regexp filter host:~"$host" it partially works, but matches anything with that substring. also, regexp filter is not good for performance

Originally, the workaround with using regex start and end matchers worked, but later the quotes were introduced, now this doesn't work anymore. I would prefer that the quoting is left to the user of the variable and not returning quotes in the variable value. https://github.com/VictoriaMetrics/victorialogs-datasource/issues/65#issuecomment-2312876313

UPDATE: host:in($host) works correctly. Simplified this issue to be specific about the interpolation in regex.

pgassmann avatar Feb 17 '25 08:02 pgassmann

I just noticed, thanks to VictoriaMetrics/VictoriaMetrics#200, that there's more magic in the interpolation. Is there some documentation about this?

host:in($host) actually works correctly and is resolved to host:in(backup3,backup4,dev)

host:=$host is interpreted in yet another way, it is transformed to host:=backup3 OR backup4 OR dev

which is not correct, as reported in VictoriaMetrics/VictoriaMetrics#200

pgassmann avatar Feb 17 '25 16:02 pgassmann

Hey @pgassmann! Thanks for the request!

As I imagine typical dashboards, there should be the following matching cases:

  • field:in(foo, bar) - exact match of one or more words
  • field:~(foo, bar) - regex/substring match of one or more words
  • field!~(foo, bar) - regex/substring to exclude matched words

I don't know if field!in(foo, bar) will be ever needed, though.

So far, looks like we can already use these structures. For example, field:in($field_filter) is transformed by plugin into field:in(foo,bar) expression.

But a few things are missing:

  • arguments passed to () should be quoted: field:in("foo","bar"). This should make correct matching if $field_filter contains options with spaces: "foo bar", "baz"
  • arguments passed to () should be escaped: field:~("foo\.bar"). This should prevent from incorrect regex matching.
  • all constructions should continue working if $field_filter has no selected options. In other words: don't apply filter.

We should also discourage people from using field:$field_filter syntax if $field_filter is a multi-value variable. I think, we did this by mistake in response to this comment and it is better to stop supporting this sooner than later.

hagen1778 avatar Feb 26 '25 08:02 hagen1778

We should also discourage people from using field:$field_filter syntax if $field_filter is a multi-value variable.

I disagree. The most common case is when the $field_filter must exactly match one of the selected values. It would be great if the plugin could automatically expand the $field_filter into in("v1", ..., "vN") if at least one value selected (the values must be properly quoted into double quotes), and into (* or "") if no values are selected (this can be simplified into in(*) in the future when VictoriaLogs will support such a construct.

field:~(foo, bar) - regex/substring match of one or more words

This syntax isn't supported. You must use ~"foo|bar" instead, and the values must be properly escaped regarding regex syntax special chars such as . , *, [, etc. Also the final regex must be properly escaped inside double quotes.

valyala avatar Feb 26 '25 15:02 valyala

Also, I think it shouldn't be any difference with the extrapolation of the $field_filter for multi-value and single-value filters, since both filter types must exactly match the selected values. Note that the field:value filter matches any values, which contain value word at the field. For example, it matches foo value bar, value, foo value and value bar. IMHO, this is not what most users expect - they expect matching for the value value only, since it has been selected in the filter.

valyala avatar Feb 26 '25 16:02 valyala

FYI, the commit https://github.com/VictoriaMetrics/VictoriaMetrics/commit/84d5771b41cef5f8b80acd8c11b8546aadc5b0f7 adds support for in(*) filter into VictoriaLogs. This filter is treated as match all. This commit will be included in the next release of VictoriaLogs. After that the VictoraLogs datasource can replace the $field_filter with in(*) if all values were selected by the user for the given field.

valyala avatar Feb 27 '25 10:02 valyala

It would be great if the plugin could automatically expand the $field_filter into in("v1", ..., "vN") if at least one value selected

I don't like this idea. If $field_filter will always expand into in("v1", ..., "vN") construction then it would be impossible to use this variable with field:~$field_filter or field!~$field_filter syntax.

Thanks for adding https://github.com/VictoriaMetrics/VictoriaMetrics/commit/84d5771b41cef5f8b80acd8c11b8546aadc5b0f7. I suggest we can update variables parsing in a way that would satisfy the following truth table:

Syntax Variable Expansion Description
field:in($field_filter) "foo", "bar" field:in("foo", "bar") Select all fields that match exactly "foo" or "bar"
field:in($field_filter) *All field:in(*) Don't apply filtering
field:~$field_filter "foo", "bar" field:~"foo|bar" Select all fields that match regular expressions "foo" or "bar"
field:~"^$field_filter-.*" "foo" field:~"^foo-.*" Select all fields that match regular expression "^foo-.*". Quoting is not applied.
field:~$field_filter *All field:~".*" Don't apply filtering
field!~$field_filter "foo", "bar" field!~"foo|bar" Select all fields that don't match regular expressions "foo" or "bar"
field:=$field_filter "foo.*bar" field:="foo.*bar" Select all fields that exactly match "foo.*bar", without escaping
field:=$field_filter "foo", "bar" field:="foo|bar" A bad user input: using exact match := with multi-value variable. It will try to exact-match "foo|bar" string. Can we notify user about it?

*All is option to not apply filtering.

In implementation, the query parser should locate the variable and check substring before it:

  1. if substring before it is in(, then variable should be expanded into escaped comma-separated quoted list of items: "foo", "ba\|r".
  2. If the substring before it is ~, then the variable should be escaped quoted |-separated list of items: "foo|ba\|r".
  3. If the substring before it is ~", then expand variable as is: no quoting or escaping. Multi-value should be concatenated with | char.
  4. if substring before it is =, then variable should be expanded into non-escaped quoted | string: "foo|bar".

We should also discourage people from using field:$field_filter syntax if $field_filter is a multi-value variable.

See the latest example in the table above - this is one of the reasons why I want to discourage using := matcher. Especially it can be effectively replaced with field:in($field_filter) syntax.

@valyala please also see the difference for *All option table for field:in($field_filter) and field:~$field_filter. In Grafana, you need to specify custom All option for variables. Now, for in: case we need to specify * as custom value. But for :~ we 'd need to specify .* value. Can we resolve this inconsistency in logsql by allowing in(*) selector?

hagen1778 avatar Feb 27 '25 16:02 hagen1778

@hagen1778 in general, I agree with you. But I would leave the quoting to the user. especially for regex filters.

the user should write field:~"$field_filter"

This allows usecases like container_name:~"^$project-.*" which would expand to container_name:~"^project1-.*" or container_name:~"^(project1|project2)-.*" or for All: container_name:~"^.*-.*"

pgassmann avatar Feb 27 '25 18:02 pgassmann

and unless a custom value "*" for All is specified, the entire list should be passed, as the list could be filtered in the variables query

pgassmann avatar Feb 27 '25 18:02 pgassmann

I suggest we can update variables parsing in a way that would satisfy the following truth table:

The provided table looks mostly OK, except of the following things:

  • It contains invalid syntax for field!~regex - it must be written as field:!~regex, since all the filters on the given field must start with field:, e.g. they must contain field name followed by :.
  • It contains invalid filter for field:=$filter where $filter is expanded into multiple values. E.g. field:="foo|bar" filter matches the field, which equals to foo|bar value. The correct filter, which matches one of the selected values, must be field:in("foo", "bar").

Also the table is too complicated for the average user. As I understood, the majority of single-value and multi-value variables allow selecting values for some log field. Later these values must be used in strict equality filter aka in(...) in close to 100% cases. The = filter is a particular case of the in(...) filter, e.g. field:="foo" is fully equivalent to field:in("foo").

It is a mistake to use these values in other filter types such as word and phrase filter and regex filters aka ~ because of the following reasons:

  • The word filter and phrase filter matches field values, which contain the given word / phrase. For example, "foo" matches prefix foo, foo suffix and prefix foo suffix. This isn't what most users expect - they expect field values, which exactly match the selected variable values only.
  • The ~ filter matches values, which contain the provided regexp. For example, ~"foo" matches some_prefix_foo, foo_some_suffix and prefix_foo_suffix. I doubt this is what most users expect.
  • The ~ (regex) filter is the most slowest filter at LogsQL. Using it instead of in(..) filter results in much slower query performance in general case.

Given the reasons above, I'd recommend users to use field:$field_filter syntax in most cases, and discouraging using any other more error-prone and most likely invalid filters such as field:=$field_filter, field:~$field_filter, field:in($field_filter), etc. The VictoriaLogs datasource must expand the field:$field_filter into field:in("v1", ..., "vN"), where v1, v2, ..., vN are properly quoted selected values. If all values is selected, then field:in(*) must be used.

The quoting rule for the selected values is simple: just use json.Stringify(value) for each value before putting it into in(...) filter.

How to use $field_filter in negative filters?

Just put - in front of the field name: -field:$field_filter must be expanded into -field:in(...), which, in turn, selects logs without the selected values inside $field_filter.

How to use $field_filter with other filter types?

I'd completely disallow using $field_filter for single-value and multi-value variables in other filter types, since this is 100% mistake from the user's side, which will result in much slower query and may return unexpected results.

Escaping rules for regular expressions

This section isn't related to the text above. This section is added here because it looks like there is a confusion on how to properly escape values inside regular expressions (this concerns both VictoriaLogs and VictoriaMetrics plugins for Grafana, @hagen1778 ). For example, if we decide to allow ~$field_filter or ~"$field_filter" syntax, then the values inside $field_filter must be properly escaped in order to avoid clashing with all the special chars, which are supported by regular expression. These special chars include: ., *, |, [, {, -, (, ?, etc. The most straightforward way to do this is to use RegExp.escape function for every selected value, and then join the results with |. After that the resulting string must be properly quoted with JSON.stringify.

valyala avatar Feb 27 '25 22:02 valyala

FYI, the in(*) syntax is supported by VictoriaLogs starting from v1.15.0-victorialogs.

valyala avatar Feb 27 '25 23:02 valyala

and unless a custom value "*" for All is specified, the entire list should be passed, as the list could be filtered in the variables query

That is done automatically by Grafana.

the user should write field:~"$field_filter"

I've updated the table. Could you please verify it again?

hagen1778 avatar Feb 28 '25 07:02 hagen1778

@valyala the $field_filters are just dashboard variables that could be placed anywhere by the user (when building a dashboard). Using the right filters should be up to the user. Converting field:$variable to field:in($variable) would not be what I expect from the datasource. I didn't expect any magic interpolation, I assumed it would just return regex-syntax (value1|value2) as I was used from prometheus datasource. I don't know how other datasources handle variables and how this magic conversion can be communicated to the users in grafana.

@hagen1778 thanks. So you want to conditionally apply quoting. That means, you would need to detect if the variable is used within quotes and not apply quoting if you find a quote between field: and $variable.

Wouldn't it be better to leave the quoting entirely to the user?

  • field:$field_filter would be field:(foo OR bar)
  • field:in($field_filter) would be field:in(foo,bar)
  • field:~$field_filter would be field:~(foo|bar) => invalid, user should write field:~"$field_filter", within regex it should be extended to (foo|bar) not just foo|bar, the parentheses are missing in your table.
  • field:=$field_filter would be field:=(foo OR bar) => currently not valid, see my feature request

what if the individual values need quoting? should it be field:("foo 1" OR "bar 2")? but what if the user writes field:"$field_filter"

In your table, the negated regex query is not yet correct as pointed out by @valyala. the : is missing. field!~$field_filter should be field:!~$field_filter Then there's the the issue with the custom all value. if the variable is used with field:in($var), it needs to be set to *: in(*), if it is used in regex, it needs to be set to .*.

pgassmann avatar Feb 28 '25 10:02 pgassmann

What about middle ground proposal: we recommend users to use field:$field_filter or field:in($field_filter) syntax for filtering. This is indeed what 99% of cases would need:

User query Variable Expanded query Description
field:in($field_filter) "foo", "bar" field:in("foo", "bar") Select all fields that match exactly "foo" or "bar".
field:$field_filter "foo", "bar" field:in("foo", "bar") Alias for field:in($field_filter). Select all fields that match exactly "foo" or "bar".
-field:in($field_filter) "foo", "bar" -field:in("foo", "bar") Select all fields except those that match exactly "foo" or "bar"
-field:$field_filter "foo", "bar" -field:in("foo", "bar") Alias for -field:in($field_filter). Select all fields except those that match exactly "foo" or "bar"
field:in($field_filter) *All field:in(*) Don't apply filtering

I don't like having alias for in, as it could introduce more confusion. But this would make change backward compatible with existing syntax app:(buggy_app OR foobar), which is equal to app:in("buggy_app", "foobar").

Then, for the rest 1% case we can introduce the rules to expand a variable if it is not wrapped in in() construct:

User query Variable Expanded query
field:~$var "foo" field:~foo
field:~"^$var*" "foo-." field:~"^foo-.*"
field:~$var "foo", "bar" field:~"(foo|bar)"
field:~"$var" "foo", "bar" field:~"(foo|bar)"

Here, the variable is expanded like this despite the comparison operator. It behaves like this always, which simplifies implementation and understanding.

@valyala @pgassmann would that satisfy both cases: the most common filter type for 99% of cases, and regex filtering where $var always expands into "(option1|option2)"?


Custom All variable

It is probably possible to support .* inside in() operator. But even without this user should understand how the $var will be used: with :in or with :~. And set custom All value to * or .* correspondingly.

Escaping rules for regular expressions

I don't see why we need escaping. For in() operator we don't need it. For :~ operator it seems like user intentionally wants to use regexes inside vars, so no escaping is needed.

hagen1778 avatar Mar 03 '25 11:03 hagen1778

your proposal changes the meaning of the word filter syntax to be the same as exact match.

field:$field_filter should not be the same as field:in($field_filter)

app:(buggy_app OR foobar), which is equal to app:in("buggy_app", "foobar").

that's not how I understand it. the first is a word match for multiple words, the second is an exact match.

field could be _msg or any longer content field. and the variable could be a word to look for like ERROR, INFO, DEBUG.

the user could build a query like: _msg:$loglevel, meaning the log line should contain any of the selected words. why should this be treated as an exact match? IMO this should be expanded to _msg:(ERROR OR INFO OR DEBUG)

I would propose the simple rule:

  • if it is after ~ (regex), transform to (foo|bar),
  • if in in(), use comma separated list ("foo","bar")
  • everywhere else use OR separated list ("foo" OR "bar")

pgassmann avatar Mar 03 '25 12:03 pgassmann

The conversation above shows that it is too hard (almost impossible to an average user) to put the $var into the correct context inside LogsQL query, since even VictoriaLogs developers and experienced users have invalid assumptions on how the particular LogsQL filters work. Incorrectly placed $var may result in the following issues:

  • Unexpected results. For example, if the user writes field:~$var or field:$var instead of field:in($var) where the $var contains values for some log field, then unexpected results may be returned:
    • field:~"foo|bar" returns logs, which contain foo or bar substrings inside the field. For example, it matches foo, aaafoo, fooaaa, aaafoobar, abara, etc.
    • field:("foo" OR "bar") returns logs, which contain foo or bar words inside the field. For example, it matches foo, aaa foo, foo aaa, aaa foo bar, a bar a, etc.
  • Significant query slowdown. For example, field:~$var works much slower (1000x slower and more) than field:in($var) for fields containing non-constant values, since it cannot use bloom filters for skipping data blocks without the given values, and the regular expression matching is usually much slower than exact string matching (especially if some special chars such as . aren't escaped in the variable).

In the most cases users expect logs where the field value exactly matches the foo or bar, e.g. field:in("foo", "bar"). That's why I recommend expanding field:$var into field:in("v1", ..., "vN"), since it correctly covers the majority of cases. Important note: the v1, ..., vN must be properly escaped before putting into doublequotes. The best approach it to use JSON.stringify() per each value - it properly puts the value into doublequotes with the proper escaping. For example, if the value equals to foo"bar\w, then JSON.stringify() properly returns "foo\"bar\\w".

The =$var and !=$var contexts must be left for backwards compatibility. They must be expanded into in("v1", ..., "vN") and !in("v1", ..., "vN").

Another useful context for variables interpolation is some_func($var). It must be expanded into some_func("v1", ..., "vN"). It covers in($var) and contains_any($var).

Interpolation of the $var in all the other contexts (such as field:~$var) must be disallowed, since they will be used improperly instead of field:$var or field:in($var) in the majority of cases. The plugin must show an error to the user with the link to docs on how to use interpolation. The docs must recommend using field:$var and must mention field:contains_any($var) as an alternative to field:("v1" or ... or "vN").

The third important context, which must be supported for the variable interpolation is inside the stream filter. I'd allow only the following cases there:

  • {field in ($var)} - it must be expanded into {field in ("v1", ..., "vN")}
  • {field not_in ($var)} - it must be expanded into {field not_in ("v1", ..., "vN")}

All the other cases for stream filter such as {field="$var"} and {field=~"$var"} must be disallowed, since they will be used improperly in the majority of cases, and they can be easily replaced by {field in ($var)}.

Summary

  • :$var, e.g. : followed by $var must be expanded into :in("v1", ..., "vN") with proper doublequoting of every value via JSON.stringify().
  • =$var, e.g. = followed by $var must be expanded into in("v1", ..., "vN") with proper doublequoting of every value via JSON.stringify()
  • some_func($var), e.g. some_func( followed by $var followed by ) must be expanded into some_func("v1", ..., "vN") with proper double quoting of every value via JSON.stringify()
  • in ($var), e.g. in followed by whitespace with ( followed by $var followed by ) must be expanded into in ("v1", ..., "vN") with proper double quoting of every value via JSON.stringify().
  • not_in ($var), e.g. not_in followed by whitespace with ( followed by $var followed by ) must be expanded into not_in ("v1", ..., "vN") with proper double quoting of every value via JSON.stringify().

All the other contexts must be disallowed for variable interpolation. This primarily concerns the ~$var and ~"$var", since it is incorrectly interpolated (without proper escaping) and leads to very slow queries in the most cases.

valyala avatar Mar 16 '25 02:03 valyala

I will summarize the decision for this issue, correct me if I'm wrong.

We'll adopt a strict, predictable default for variable interpolation. Wherever a variable represents field values, expand field:$var and field=$var to field:in("v1", ..., "vN") with proper JSON‑style quoting/escaping; treat empty/All as in(*). Keep function contexts like in($var) and contains_any($var) expanding to quoted lists, and map field=$var / field!=$var to field:in(...) / !field:in(...) for compatibility. In stream filters, allow {tag in ($var)} and {tag not_in ($var)}. Disallow interpolation in other contexts (e.g., field:~$var) and show a helpful error with a reference link.

cc @Loori-R

If regex alternation is needed in the future, we could provide an explicit opt-in (e.g., regex_any($var)) or a "regex variable" flag.

func25 avatar Sep 04 '25 03:09 func25

PR with the fix was merged and will be included to the next release

arturminchukov avatar Oct 29 '25 07:10 arturminchukov

@pgassmann @valyala @func25 , The issue was fixed in the v0.21.4 plugin version. Close the issue as fixed. If you find any problems, feel free to reopen the issue

arturminchukov avatar Nov 11 '25 08:11 arturminchukov