quickwit icon indicating copy to clipboard operation
quickwit copied to clipboard

Dynamic Mapping should not tokenize values

Open machete-michael opened this issue 2 years ago • 4 comments

I’m using dynamic mapping to ingest a JSON object with a field of an array of JSON objects. If the array element has a field with value that has -, _, #, etc. delimiters in it, e.g. a uuid, querying against this field will result in the error:

“SplitSearchError { error: \”Invalid query: The field ‘_dynamic’ does not have positions indexed\”…}”

Steps to reproduce (if applicable) Steps to reproduce the behavior:

  1. Set up index config to use dynamic mode
  2. Ingest an object with an array of objects with a field mapped to a uuid 3.
  3. Execute a term query by matching a uuid.4.

Expected behavior I should get a matching document as a response

Configuration: Please provide:

  1. Output of quickwit --version v0.3.1nightly
  2. The index_config.yaml version: 0 index_id: foo doc_mapping: mode: dynamic field_mappings: -name: id type: text tokenizer: raw

machete-michael avatar Aug 02 '22 22:08 machete-michael

Thanks @machete-michael for the report.

More info on this issue.

Here is a request made on a default dynamic mapping (see docs example) that shows the same error:

curl -XGET http://localhost:7280/api/v1/my_dynamic_index/search\?query\=cart.product_description:cherry-pi
{
  "InvalidQuery": "The field '_dynamic' does not have positions indexed"
}% 

Without the - character, everything works well. Somehow, adding - is triggering a phrase query. But the cause can come from something totally different (like something happening in tantivy generate_literals_for_json_object function. We need to investigate what's happening.

fmassot avatar Aug 02 '22 23:08 fmassot

The query parser identify the string to search for correctly. The default tokenizer splits it into several tokens ([cherry, py]) which triggers the phrase query.

Probably the right fix would be to emit an intersection query here, if position are not available instead of emitting a error.

fulmicoton avatar Aug 19 '22 13:08 fulmicoton

@machete-michael sorry for the long silence. A new eye on this issue made me think that you may be interested in a uuid friendly tokenizer.

We have open an issue on this: https://github.com/quickwit-oss/quickwit/issues/1143

There is a PR that is almost mergeable here too: https://github.com/quickwit-oss/quickwit/pull/1598

Is this something you are interested in?

fmassot avatar Sep 30 '22 09:09 fmassot

Hi @fmassot,

Thank you for looking into this issue.

UUID friendly tokenizer may just solve the issue with values with dashes and not the issues with the other delimiters.

In any case, I’ve move on to other solutions and am not waiting for a fix.

Please feel free to close the issue.

machete-michael avatar Sep 30 '22 20:09 machete-michael

The query parser identify the string to search for correctly. The default tokenizer splits it into several tokens ([cherry, py]) which triggers the phrase query.

Probably the right fix would be to emit an intersection query here, if position are not available instead of emitting a error.

Shouldn't we use the same tokenizer as set in the config for the field ("raw")

The PhraseQuery issue would still persist for fields that are tokenized. I'm not sure about an intersection query, since it may silently return wrong results.

PSeitz avatar Nov 28 '22 06:11 PSeitz

Hi here,

I've just spawned a fresh install of Quickwit 0.5 and I've configured a very simple index with no fields mapping (pure dynamic mode). I'm ingesting JSON logs from Vector. In that configuration, I cannot search anything with characters "-",",",".",SPACE .. I get the error : Invalid query: The field '_dynamic' does not have positions indexed" 100% of time. It does not depend on the field I'm searching on. Most the fields I've tried are supposed to be simple string fields.

Examples:

  • dns.rrname:"10,10"
  • dest_ip:"8.8.8.8"
  • string:"Hello World"

If I search a term without these special chars it works with no problem. This issue is quite problematic because almost all the searches I would like to do fail 🙁 I've tried to delete my index and restart from scratch with no success

My index configuration :

version: 0.5
index_id: suricata
doc_mapping:
  mode: dynamic
indexing_settings:
  commit_timeout_secs: 10

Did I miss something on the setup/configuration ?

peacand avatar Apr 24 '23 06:04 peacand

The queries you are trying to run are so-called phrase queries (due to the quotation mark). They require to store the token positions to run... This is someting that is not enabled by default but you can enable it as follows.

version: 0.5
index_id: suricata
doc_mapping:
  mode: dynamic
  dynamic_mapping:
       record: position # default to basic
indexing_settings:
  commit_timeout_secs: 10

fulmicoton avatar Apr 24 '23 09:04 fulmicoton

Thank you @fulmicoton ! I confirm it works perfectly !

peacand avatar Apr 24 '23 10:04 peacand