OpenSearch
OpenSearch copied to clipboard
Add support for `wildcard` field type
Elasticsearch added the wildcard
field type in v7.9. Are there any plans to support the field type in OpenSearch?
Thanks!
A clarification for others, Elasticsearch added wildcard field type as an x-pack feature in version 7.9.0, which is not an open-source feature. See the user guide of Elasticsearch 7.9 https://www.elastic.co/guide/en/elasticsearch/reference/7.9/keyword.html#wildcard-field-type
@epiphone Could you describe your use case? I do understand that it speeds up wildcard searches, but why is wildcard search performance critical in your application? How many unique values are in your dataset and how big are they? Are there any good workarounds?
Some good use cases for wildcard:
- Matching error messages and stack traces
- Matching URL and file paths
- Matching fields that have encoded content (e.g. "(8-10)||86128||Women's Apparel||...")
@macrakis my use case is a large index of user-submitted names where most names are short (<100 characters) and unique, and I want to query the names by arbitrary substrings.
As a workaround I'm using an ngram tokenizer which works well enough but is more complicated to set up than the wildcard field type.
Josef, Epiphone, thanks for your answers -- very helpful!
So it sounds like you need to find arbitrary substrings in your corpus, not just strings starting at token boundaries.
That would be not just "clerc" in "Leclerc", but also "ecle", not just "org/open" in "server/src/main/java/org/opensearch/index/query" but also "ense" in that pathname.
Could the problems be solved by different tokenization?
What makes the 'wildcard' data-type nice is that it is optimized for fields with large values or high cardinality for wildcard and regexp queries without changing the search experiences (e.g. searching via *ense*) and without worrying about tokenization.
We have a use case for this where we need to index large XML documents that are > 32766 bytes. Our users want to be able to search for a string in an XML document eg *failuremessage122* or just *failure* or even *fail*. A keyword field type would make sense (despite the poor leading wildcard performance) but this is not possible due to exceeding the 32766 byte limit.
Tokenisation is also problematic with XML docs and we also get issues where we have token explosion with > 10000 terms generated when using most of the analyzers.
The XML logstash filter was considered but has similar issues with large documents producing a huge amount of fields. We don't always know ahead of time which elements we need to search for so that pre-processing of data isn't really an option.
Support for a "wildcard" field type would really improve our user experience
wildcard field type has been supported sans x-pack since ElasticSearch 7.11 https://www.elastic.co/guide/en/elasticsearch/reference/7.11/keyword.html#wildcard-field-type any plans to support it in OpenSearch?
AFAIK nobody is working on this.
If someone wants to give it a shot, there are folks contributing flattened field type via https://github.com/opensearch-project/OpenSearch/issues/1018, and looks like there’s a draft PR in https://github.com/opensearch-project/OpenSearch/pull/6507 - can be used as an inspiration.
Please note that we cannot accept any code from ES > 7.10.2, which was the last version under APLv2. Would welcome an independent implementation that doesn't look at anything under an incompatible license.
Re using wildcard field for XML (https://github.com/opensearch-project/OpenSearch/issues/5639#issuecomment-1422345013), I wonder if you could use the XML logstash filter and then the Flat field type which is coming out in 2.7? (https://github.com/opensearch-project/OpenSearch/issues/1018#issuecomment-1188365805).
@macrakis I've had a look through the docs for a Flat field type and without ruling it out completely I'd have some concerns:
- Flat fields are useful when a field and its subfields will mostly be read, and not be used as search criteria.
- Performance should be similar to a keyword field. I imagine this will be awful for wildcard searches where we need to search for "foo" OR "bar" in the XML.
That being said I'd be willing to give it a try in our development environments when this feature is released
There was some good discussion over on https://github.com/opensearch-project/OpenSearch/issues/12500, which highlighted the value of wildcard fields.
Also, Elastic's blog post about the feature provides a really good explanation: https://www.elastic.co/blog/find-strings-within-strings-faster-with-the-new-elasticsearch-wildcard-field
Another reason to implement it: if you want to use ECS 8.12, it's used in the standard component templates. Trying to load them:
{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"No handler for type [wildcard] declared on field [content]"}],"type":"mapper_parsing_exception","reason":"Failed to parse mapping [_doc]: No handler for type [wildcard] declared on field [content]","caused_by":{"type":"mapper_parsing_exception","reason":"No handler for type [wildcard] declared on field [content]"}},"status":400}_component_template/ecs_8.0.0_http
What makes the 'wildcard' data-type nice is that it is optimized for fields with large values or high cardinality for wildcard and regexp queries without changing the search experiences (e.g. searching via ense) and without worrying about tokenization.
This is similar to our use-case. We are storing large json objects (log data) where the json keys are not known in advance. We are using flat_object
for this but cannot store values larger than 32kb. The wildcard type allows for values > 32kb and would save us from having to drop fields > 32kb before indexing.
There is a draft PR out now: https://github.com/opensearch-project/OpenSearch/pull/13461#issue-2270363868
Fantastic thanks!