OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

Add support for `wildcard` field type

Open epiphone opened this issue 2 years ago • 14 comments

Elasticsearch added the wildcard field type in v7.9. Are there any plans to support the field type in OpenSearch?

Thanks!

epiphone avatar Dec 27 '22 08:12 epiphone

A clarification for others, Elasticsearch added wildcard field type as an x-pack feature in version 7.9.0, which is not an open-source feature. See the user guide of Elasticsearch 7.9 https://www.elastic.co/guide/en/elasticsearch/reference/7.9/keyword.html#wildcard-field-type

tlfeng avatar Dec 27 '22 09:12 tlfeng

@epiphone Could you describe your use case? I do understand that it speeds up wildcard searches, but why is wildcard search performance critical in your application? How many unique values are in your dataset and how big are they? Are there any good workarounds?

macrakis avatar Jan 24 '23 16:01 macrakis

Some good use cases for wildcard:

  • Matching error messages and stack traces
  • Matching URL and file paths
  • Matching fields that have encoded content (e.g. "(8-10)||86128||Women's Apparel||...")

josefschiefer27 avatar Jan 28 '23 20:01 josefschiefer27

@macrakis my use case is a large index of user-submitted names where most names are short (<100 characters) and unique, and I want to query the names by arbitrary substrings.

As a workaround I'm using an ngram tokenizer which works well enough but is more complicated to set up than the wildcard field type.

epiphone avatar Jan 30 '23 07:01 epiphone

Josef, Epiphone, thanks for your answers -- very helpful!

So it sounds like you need to find arbitrary substrings in your corpus, not just strings starting at token boundaries.

That would be not just "clerc" in "Leclerc", but also "ecle", not just "org/open" in "server/src/main/java/org/opensearch/index/query" but also "ense" in that pathname.

Could the problems be solved by different tokenization?

macrakis avatar Jan 30 '23 21:01 macrakis

What makes the 'wildcard' data-type nice is that it is optimized for fields with large values or high cardinality for wildcard and regexp queries without changing the search experiences (e.g. searching via *ense*) and without worrying about tokenization.

josefschiefer27 avatar Jan 31 '23 05:01 josefschiefer27

We have a use case for this where we need to index large XML documents that are > 32766 bytes. Our users want to be able to search for a string in an XML document eg *failuremessage122* or just *failure* or even *fail*. A keyword field type would make sense (despite the poor leading wildcard performance) but this is not possible due to exceeding the 32766 byte limit.

Tokenisation is also problematic with XML docs and we also get issues where we have token explosion with > 10000 terms generated when using most of the analyzers.

The XML logstash filter was considered but has similar issues with large documents producing a huge amount of fields. We don't always know ahead of time which elements we need to search for so that pre-processing of data isn't really an option.

Support for a "wildcard" field type would really improve our user experience

stevesimpson418 avatar Feb 08 '23 10:02 stevesimpson418

wildcard field type has been supported sans x-pack since ElasticSearch 7.11 https://www.elastic.co/guide/en/elasticsearch/reference/7.11/keyword.html#wildcard-field-type any plans to support it in OpenSearch?

vindurriel avatar Feb 22 '23 08:02 vindurriel

AFAIK nobody is working on this.

If someone wants to give it a shot, there are folks contributing flattened field type via https://github.com/opensearch-project/OpenSearch/issues/1018, and looks like there’s a draft PR in https://github.com/opensearch-project/OpenSearch/pull/6507 - can be used as an inspiration.

Please note that we cannot accept any code from ES > 7.10.2, which was the last version under APLv2. Would welcome an independent implementation that doesn't look at anything under an incompatible license.

dblock avatar Mar 03 '23 19:03 dblock

Re using wildcard field for XML (https://github.com/opensearch-project/OpenSearch/issues/5639#issuecomment-1422345013), I wonder if you could use the XML logstash filter and then the Flat field type which is coming out in 2.7? (https://github.com/opensearch-project/OpenSearch/issues/1018#issuecomment-1188365805).

macrakis avatar Mar 30 '23 01:03 macrakis

@macrakis I've had a look through the docs for a Flat field type and without ruling it out completely I'd have some concerns:

  • Flat fields are useful when a field and its subfields will mostly be read, and not be used as search criteria.
  • Performance should be similar to a keyword field. I imagine this will be awful for wildcard searches where we need to search for "foo" OR "bar" in the XML.

That being said I'd be willing to give it a try in our development environments when this feature is released

stevesimpson418 avatar Mar 31 '23 12:03 stevesimpson418

There was some good discussion over on https://github.com/opensearch-project/OpenSearch/issues/12500, which highlighted the value of wildcard fields.

Also, Elastic's blog post about the feature provides a really good explanation: https://www.elastic.co/blog/find-strings-within-strings-faster-with-the-new-elasticsearch-wildcard-field

msfroh avatar Feb 29 '24 22:02 msfroh

Another reason to implement it: if you want to use ECS 8.12, it's used in the standard component templates. Trying to load them:

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"No handler for type [wildcard] declared on field [content]"}],"type":"mapper_parsing_exception","reason":"Failed to parse mapping [_doc]: No handler for type [wildcard] declared on field [content]","caused_by":{"type":"mapper_parsing_exception","reason":"No handler for type [wildcard] declared on field [content]"}},"status":400}_component_template/ecs_8.0.0_http

sandervandegeijn avatar Apr 03 '24 21:04 sandervandegeijn

What makes the 'wildcard' data-type nice is that it is optimized for fields with large values or high cardinality for wildcard and regexp queries without changing the search experiences (e.g. searching via ense) and without worrying about tokenization.

This is similar to our use-case. We are storing large json objects (log data) where the json keys are not known in advance. We are using flat_object for this but cannot store values larger than 32kb. The wildcard type allows for values > 32kb and would save us from having to drop fields > 32kb before indexing.

stowns avatar May 10 '24 14:05 stowns

There is a draft PR out now: https://github.com/opensearch-project/OpenSearch/pull/13461#issue-2270363868

getsaurabh02 avatar May 21 '24 16:05 getsaurabh02

Fantastic thanks!

sandervandegeijn avatar Jun 11 '24 09:06 sandervandegeijn