Special characters in searching feature with Tika
Describe the bug
File that contains special characters like any of these ñíôùë*+ç$%&@? . iOS client performs a server-side-search over the content using an special character. Server does return empty or incomplete result. Some results seems not to be correct.
Steps to reproduce
Create a txt file called sum.txt with the following content:
one+one=two
four+four=eight
using iOS client, searching by content:
| typed string in client | returned | Passed? | comment |
|---|---|---|---|
| one | no result | ❌ | one exists twice |
| four | sum.txt |
✅ | |
| + | no result | ❌ | + exists |
| = | no result | ❌ | = exists |
| on | sum.txt |
✅ | |
| one+one | no result | ❌ | one+one fits the content |
| four+four | no result | ❌ | four+four fits the content |
there are more examples with the other characters mentioned above.
This is the curl you can use to reproduce:
curl -X REPORT https://xx.xx.xx.xx:9200/remote.php/dav/spaces
-H 'Host: xx.xx.xx.xx:9200'
-H 'Original-Request-ID: B88FCC21-5664-4706-96A3-C3C2B314770B' --compressed
-H 'Connection: keep-alive'
-H 'User-Agent: ownCloudApp/12.4.0 (App/296; iOS/18.2; iPhone)'
-H 'Accept-Language: en'
-H 'Authorization: Bearer ...'
-d '<?xml version="1.0" encoding="UTF-8"?>
<oc:search-files xmlns:D="DAV:" xmlns:oc="http://owncloud.org/ns">
<D:prop>
<D:resourcetype/>
<D:getlastmodified/>
<D:getcontentlength/>
<D:getcontenttype/>
<D:getetag/>
<oc:id/>
<oc:size/>
<oc:permissions/>
<oc:favorite/>
<oc:share-types/>
<oc:owner-id/>
<oc:owner-display-name/>
</D:prop>
<oc:search>
<oc:pattern>(content:"*one*")</oc:pattern>
<oc:limit>100</oc:limit>
</oc:search>
</oc:search-files>'
replace *one* in the oc:pattern property for different results.
Setup
Created a docker container with server side search by using the following docker-compose.yml file:
version: "3.7"
services:
ocis:
image: owncloud/ocis:7.0.1
ports:
- "9200:9200"
- "9215:9215"
environment:
OCIS_INSECURE: "true"
OCIS_URL: "https://IP:9200"
IDM_CREATE_DEMO_USERS: "true"
IDM_ADMIN_PASSWORD: "admin"
PROXY_ENABLE_BASIC_AUTH: "true"
OCIS_SERVICE_ACCOUNT_ID: "b0fbfad7-3dd6-49cb-b468-3f588f2f82be"
OCIS_SERVICE_ACCOUNT_SECRET: "asaGE4DF"
SEARCH_EXTRACTOR_TYPE: tika
SEARCH_EXTRACTOR_TIKA_TIKA_URL: "http://tika:9998"
FRONTEND_FULL_TEXT_SEARCH_ENABLED: "true"
restart: "no"
entrypoint: ["/bin/sh"]
command: ["-c", "ocis init || true; ocis server"]
networks:
- ocis-net
tika:
image: apache/tika:2.9.0.0-full
restart: "always"
networks:
- ocis-net
networks:
ocis-net:
</p>
</details>
## Additional context
Add any other context about the problem here.
OTOH, using a date for a more accurate search, does not work either
<oc:search>
<oc:pattern>((mtime:>2024-12-31T23:00:00+00:00) AND (name:"*Joe*"))</oc:pattern>
<oc:limit>100</oc:limit>
</oc:search>
returns an empty set of values.
OTOH, using a date for a more accurate search, does not work either
<oc:search> <oc:pattern>((mtime:>2024-12-31T23:00:00+00:00) AND (name:"*Joe*"))</oc:pattern> <oc:limit>100</oc:limit> </oc:search>returns an empty set of values.
A column : is not a delimiter between the property and value, it is an operator. The : doesn't make sense with > < =
https://learn.microsoft.com/en-us/sharepoint/dev/general-development/keyword-query-language-kql-syntax-reference#property-operators-that-are-supported-in-property-restrictions
cc @felix-schwarz :point_up_2:
We use the KQL ower the bleve. The KQL supports the prefix matching with phrases specified in property values, but you must use the wildcard operator * in the query, and it is supported only at the end of the phrase.
- search by
+or=returnsNo matches one+one,one one,one,on*works fine.one+*,*one*doesn't work. I'm assuming that the textone+oneis indexed as 2 separate terms in the index and wildcard is not working in this case. https://github.com/blevesearch/bleve/issues/1433#issuecomment-667349811
@2403905
A column
:is not a delimiter between the property and value, it is an operator. The:doesn't make sense with> < =https://learn.microsoft.com/en-us/sharepoint/dev/general-development/keyword-query-language-kql-syntax-reference#property-operators-that-are-supported-in-property-restrictions
IIRC back when I implemented the feature, I used the KQL reference at first but didn't get any results back from the server. (Also IIRC) it started working once I was using :< and :> instead of < & >.
I'd love to verify what I remember and either share request/response pairs - or change the respective code, but I lack a test instance (demo.owncloud.com has broken search, ocis.*.master.owncloud.works are all offline for a while now).
Update: Using https://owncloud.dev/ocis/deployment/ocis_full/ I set up a test instance, which appears to have working Tika integration. I'll post an update once I'm through with testing…
I now had a chance to test with the local test instance and check if the generated KQL syntax works with the current server version. I found that
- it really needed to be
:</:>to work back when I implemented it - it only works now using the KQL-correct
<and>and fixed that with two PRs in the SDK and app respectively
Will be good to add better query validation messages from the ocis side.
About the described problems:
-
Date comparison and filtering is OK after the fix from @felix-schwarz
-
About the prefix matching :
search by + or = returns No matches
is there any list of banned characters or similar somewhere to be only matched via wildcard?
Here is a bleve doc https://blevesearch.com/docs/Query-String-Query/
We escaping the characters +-=&|><!(){}[]^\"~*?:\\/ and search works fine for contains\ a\"\ character but doesn't work for 'Content:\=\=' or 'Content:\+\+' for some reason in a sample below
one+one=two
four+four=eight
++
==dsdss five
contains a" character
It looks like we should avoid using +_*. but it is just an assumption.
Searc for 'Content:ñíôù*' also works fine in a sample
ñíôùë*+ç$%&@? char ùësd
but 'Content:ç$%&@', 'Content:ç\$\%\&\@' or 'Content:ç\$\%' returns only one match
I don't see any explanation how to index mapping and analyzer works https://blevesearch.com/docs/Index-Mapping/
So, it's not totally clear how it works under the hood. I've checked with different special characters and it works fine with no scaping. For a file with the following short content:
ñíôùë*+ç$%&@?
Searches:
ñ -> ✅
ñí -> ✅
ñíô -> ✅
ñíôù-> ✅
ñíôùë -> ✅
ñíôùë* -> ✅
ñíôùë*+ -> ❌ (because the + as stated)
ç -> ✅
ç$ -> ❌
@ -> ❌
(these are the typed strings in the iOS app, the request to the server encloses them between wildcards)
The fact is, i found no characters that are scaped in the request, not sure if this will affect somehow.
OTOH, it's maybe something related with my test server, but it searches only till 8 characters. The word demonstration is in one of my files:
Searching by:
d ,de... demonstr(8 chars) are ✅ , the file is listed
demonstra (and longer like demonstration) ❌ not listed
i found nothing in the docu about the limit of characters to search by.
Thanks for your support.
@jesmrec Sorry if I confused you. You shouldn't care about escaping, the backend is doing it. https://github.com/owncloud/ocis/blob/c2736dfb471619f517ebb6759ebf9502d9340040/services/search/pkg/query/bleve/compiler.go#L31
I used the Bleve CLI search response on screenshots to give us an understanding that a problem is not on the search pattern side. Because in a current implementation we parse the pattern to KQL AST and then build the Bleve request.
I'm not sure why we have a "KQL->bleve" translator which is a big point of failure when just bleve (and by extension any search engine such as elasticsearch) is very complex to understand and to debug...
Having said that, I'll focus my explanation on bleve, using the bleve command line, so I'll skip any KQL to bleve translation being made (if any)
The default mapping configuration for the bleve index is below. Mapping configuration is in https://github.com/owncloud/ocis/blob/master/services/search/pkg/engine/bleve.go#L73 which results in the following:
{
"default_mapping": {
"enabled": true,
"dynamic": true,
"properties": {
"Content": {
"enabled": true,
"dynamic": true,
"fields": [
{
"type": "text",
"analyzer": "fulltext",
"store": true,
"index": true,
"include_term_vectors": true,
"docvalues": true
}
]
},
"Name": {
"enabled": true,
"dynamic": true,
"fields": [
{
"type": "text",
"analyzer": "lowercaseKeyword",
"store": true,
"index": true,
"include_term_vectors": true,
"include_in_all": true,
"docvalues": true
}
]
},
"Tags": {
"enabled": true,
"dynamic": true,
"fields": [
{
"type": "text",
"analyzer": "lowercaseKeyword",
"store": true,
"index": true,
"include_term_vectors": true,
"docvalues": true
}
]
}
}
},
"type_field": "_type",
"default_type": "_default",
"default_analyzer": "keyword",
"default_datetime_parser": "dateTimeOptional",
"default_field": "_all",
"store_dynamic": true,
"index_dynamic": true,
"docvalues_dynamic": true,
"analysis": {
"analyzers": {
"fulltext": {
"token_filters": [
"to_lower",
"stemmer_porter"
],
"tokenizer": "unicode",
"type": "custom"
},
"lowercaseKeyword": {
"token_filters": [
"to_lower"
],
"tokenizer": "single",
"type": "custom"
}
}
}
}
If no field is provided in the search, the "_all" field will be used (basically a "_all:searchTerm" search), which uses a keyword analyzer. The "keyword" analyzer stores the input without changes. This is used for exact matches. Wildcards can be used.
For example, assuming you upload a file called "sum.txt":
sum.txt-> finds itsum*-> finds itSum*-> does NOT find it. This is equivalent to_all:Sum*, and there is no lowercase conversion for the "_all" fieldSum.txt-> does NOT find it._all:Sum.txtdoesn't perform analysis on the input, so no match is found.Name:Sum.txt-> finds it. Configured analysis for the field transform the value to lowercase, and then it matches the stored valueName:Sum*-> does NOT find it. I don't know exactly why, but I assume it's because the "to_lower" token filter is skipped due to the wildcard
For content search, it's more complex. Based on the following content:
one+one=two
four+four=eight
ñíôùë*+ç$%&@?
demonstration
There is an unicode tokenization going on (as configured). I don't know the details, but it seems the tokens contain just unicode letters.
one+one=twowill have the tokensone,oneandtwo(no "+" nor "=")ñíôùë*+ç$%&@?will have the tokensñíôùëandç, so most of the symbols will be removed.
In addition to that, there is also a stemming transformation (also as configured).
In the end, I guess the indexed data is stored as follows (only checking with one document):
$bleve dictionary bleve Content
demonstr - 1
eight - 1
four - 1
on - 1
two - 1
ç - 1
ñíôùë - 1
Examples:
Content:demo-> not found. It isn't in the dictionaryContent:demo*-> matchesdemonstrContent:Demo*-> not found. As said above, it seems wildcards skip the analyzer, so it isn't converted to lowercase.Content:demonstrator-> matchesdemonstr. I guess it has the same stem.Content:DEmonstrator-> matchesdemonstr. It's converted to lowercase and then it has the same stem (as above)Content:demonstrat-> not found. The stem seems to be "demonstrat" which doesn't match anythingContent:\=-> not found. It doesn't generate any tokenContent:ç$-> matches just theçportion
I think that covers most of the questions. Note that this is just for bleve and has been checked using the bleve command line. There is KQL layer on top, so some things might be different, specially with weird chars that I don't know how they'll be converted from KQL to bleve syntax.
In addition, the web client is sending (name:"*demo cont*" OR content:"demo cont") or (name:"*demo cont*" OR content:"demo cont") as query string to the server. I'm pretty sure the KQL to bleve converter at least changes the field names because otherwise bleve wouldn't find a match.
I guess we want to have the same behavior, so we should send the same base query (except for what we're searching). However, I'd advise against prepending wildcards because it's considered to have bad performance (elasticsearch recommended against them, so I guess bleve will have the same problems)
Furthermore, in case data is stored weirdly, current mappings can't be changed. If we ever really need to change anything in the mappings, we'll probably need to use the same approach that we did for OC10: creating a new index and move all the data there.
Thanks a lot @jvillafanez for the clear explanation. I have just one question:
$bleve dictionary bleve Content
demonstr - 1
why demonstr and not the whole demonstration word in the index? Is it just assuming stems maximum 8-char somehow? or is it part of the KQL-magic on the top?
@felix-schwarz 👇 👇
I guess we want to have the same behavior, so we should send the same base query (except for what we're searching). However, I'd advise against prepending wildcards because it's considered to have bad performance (elasticsearch recommended against them, so I guess bleve will have the same problems)
wildcards should maybe removed from requests since the search engine is already taking care of the matches, in order not to suffer performance penaltys.
why demonstr and not the whole demonstration word in the index?
That's what the "stemmer_porter" does. It's part of the configured analysis done to the file content. The code is linked to https://github.com/blevesearch/go-porterstemmer/ and the algorithm is defined in https://tartarus.org/martin/PorterStemmer/def.txt From our side, consider it "magic" to find related words, so if the document contains "demonstration" you can find the document searching by "demonstrator", "demonstrate", etc. without using the exact word.
@jesmrec is there anything concrete we can fix on this ticket?
not ftm @kobergj , anf thanks @jvillafanez