ocis icon indicating copy to clipboard operation
ocis copied to clipboard

Special characters in searching feature with Tika

Open jesmrec opened this issue 10 months ago • 15 comments

Describe the bug

File that contains special characters like any of these ñíôùë*+ç$%&@? . iOS client performs a server-side-search over the content using an special character. Server does return empty or incomplete result. Some results seems not to be correct.

Steps to reproduce

Create a txt file called sum.txt with the following content:

one+one=two four+four=eight

using iOS client, searching by content:

typed string in client returned Passed? comment
one no result one exists twice
four sum.txt
+ no result + exists
= no result = exists
on sum.txt
one+one no result one+one fits the content
four+four no result four+four fits the content

there are more examples with the other characters mentioned above.

This is the curl you can use to reproduce:

curl -X REPORT https://xx.xx.xx.xx:9200/remote.php/dav/spaces 
-H 'Host: xx.xx.xx.xx:9200' 
-H 'Original-Request-ID: B88FCC21-5664-4706-96A3-C3C2B314770B' --compressed 
-H 'Connection: keep-alive' 
-H 'User-Agent: ownCloudApp/12.4.0 (App/296; iOS/18.2; iPhone)' 
-H 'Accept-Language: en' 
-H 'Authorization: Bearer ...' 
-d '<?xml version="1.0" encoding="UTF-8"?>
<oc:search-files xmlns:D="DAV:" xmlns:oc="http://owncloud.org/ns">
  <D:prop>
  <D:resourcetype/>
  <D:getlastmodified/>
  <D:getcontentlength/>
  <D:getcontenttype/>
  <D:getetag/>
  <oc:id/>
  <oc:size/>
  <oc:permissions/>
  <oc:favorite/>
  <oc:share-types/>
  <oc:owner-id/>
  <oc:owner-display-name/>
  </D:prop>
  <oc:search>
    <oc:pattern>(content:&quot;*one*&quot;)</oc:pattern>
    <oc:limit>100</oc:limit>
  </oc:search>
</oc:search-files>'

replace *one* in the oc:pattern property for different results.

Setup

Created a docker container with server side search by using the following docker-compose.yml file:

version: "3.7"

services:
  ocis:
    image: owncloud/ocis:7.0.1
    ports:
      - "9200:9200"
      - "9215:9215"
    environment:
      OCIS_INSECURE: "true"
      OCIS_URL: "https://IP:9200"
      IDM_CREATE_DEMO_USERS: "true"
      IDM_ADMIN_PASSWORD: "admin"
      PROXY_ENABLE_BASIC_AUTH: "true"
      OCIS_SERVICE_ACCOUNT_ID: "b0fbfad7-3dd6-49cb-b468-3f588f2f82be"
      OCIS_SERVICE_ACCOUNT_SECRET: "asaGE4DF"
      SEARCH_EXTRACTOR_TYPE: tika
      SEARCH_EXTRACTOR_TIKA_TIKA_URL: "http://tika:9998"
      FRONTEND_FULL_TEXT_SEARCH_ENABLED: "true"
    restart: "no"
    entrypoint: ["/bin/sh"]
    command: ["-c", "ocis init || true; ocis server"]
    networks:
      - ocis-net

  tika:
    image: apache/tika:2.9.0.0-full
    restart: "always"
    networks:
      - ocis-net

networks:
  ocis-net:


</p>
</details>

## Additional context
Add any other context about the problem here.

jesmrec avatar Feb 25 '25 13:02 jesmrec

OTOH, using a date for a more accurate search, does not work either

<oc:search>
    <oc:pattern>((mtime:&gt;2024-12-31T23:00:00+00:00) AND (name:&quot;*Joe*&quot;))</oc:pattern>
    <oc:limit>100</oc:limit>
</oc:search>

returns an empty set of values.

jesmrec avatar Feb 26 '25 13:02 jesmrec

OTOH, using a date for a more accurate search, does not work either

<oc:search>
    <oc:pattern>((mtime:&gt;2024-12-31T23:00:00+00:00) AND (name:&quot;*Joe*&quot;))</oc:pattern>
    <oc:limit>100</oc:limit>
</oc:search>

returns an empty set of values.

A column : is not a delimiter between the property and value, it is an operator. The : doesn't make sense with > < = https://learn.microsoft.com/en-us/sharepoint/dev/general-development/keyword-query-language-kql-syntax-reference#property-operators-that-are-supported-in-property-restrictions

2403905 avatar Mar 10 '25 21:03 2403905

cc @felix-schwarz :point_up_2:

jesmrec avatar Mar 11 '25 07:03 jesmrec

We use the KQL ower the bleve. The KQL supports the prefix matching with phrases specified in property values, but you must use the wildcard operator * in the query, and it is supported only at the end of the phrase.

  • search by + or = returns No matches
  • one+one, one one, one, on* works fine.
  • one+*, *one* doesn't work. I'm assuming that the text one+one is indexed as 2 separate terms in the index and wildcard is not working in this case. https://github.com/blevesearch/bleve/issues/1433#issuecomment-667349811
Image

2403905 avatar Mar 11 '25 18:03 2403905

@2403905

A column : is not a delimiter between the property and value, it is an operator. The : doesn't make sense with > < = https://learn.microsoft.com/en-us/sharepoint/dev/general-development/keyword-query-language-kql-syntax-reference#property-operators-that-are-supported-in-property-restrictions

IIRC back when I implemented the feature, I used the KQL reference at first but didn't get any results back from the server. (Also IIRC) it started working once I was using :< and :> instead of < & >.

I'd love to verify what I remember and either share request/response pairs - or change the respective code, but I lack a test instance (demo.owncloud.com has broken search, ocis.*.master.owncloud.works are all offline for a while now).

felix-schwarz avatar Mar 12 '25 11:03 felix-schwarz

Update: Using https://owncloud.dev/ocis/deployment/ocis_full/ I set up a test instance, which appears to have working Tika integration. I'll post an update once I'm through with testing…

felix-schwarz avatar Mar 12 '25 11:03 felix-schwarz

I now had a chance to test with the local test instance and check if the generated KQL syntax works with the current server version. I found that

  • it really needed to be :< / :> to work back when I implemented it
  • it only works now using the KQL-correct < and > and fixed that with two PRs in the SDK and app respectively

felix-schwarz avatar Mar 12 '25 11:03 felix-schwarz

Will be good to add better query validation messages from the ocis side.

2403905 avatar Mar 12 '25 12:03 2403905

About the described problems:

  • Date comparison and filtering is OK after the fix from @felix-schwarz

  • About the prefix matching :

search by + or = returns No matches

is there any list of banned characters or similar somewhere to be only matched via wildcard?

jesmrec avatar Mar 12 '25 12:03 jesmrec

Here is a bleve doc https://blevesearch.com/docs/Query-String-Query/ We escaping the characters +-=&|><!(){}[]^\"~*?:\\/ and search works fine for contains\ a\"\ character but doesn't work for 'Content:\=\=' or 'Content:\+\+' for some reason in a sample below

one+one=two
four+four=eight
++
==dsdss five
contains a" character

It looks like we should avoid using +_*. but it is just an assumption.

Searc for 'Content:ñíôù*' also works fine in a sample ñíôùë*+ç$%&@? char ùësd but 'Content:ç$%&@', 'Content:ç\$\%\&\@' or 'Content:ç\$\%' returns only one match

Image

I don't see any explanation how to index mapping and analyzer works https://blevesearch.com/docs/Index-Mapping/

2403905 avatar Mar 12 '25 16:03 2403905

So, it's not totally clear how it works under the hood. I've checked with different special characters and it works fine with no scaping. For a file with the following short content:

ñíôùë*+ç$%&@?

Searches:

ñ -> ✅ ñí -> ✅ ñíô -> ✅ ñíôù-> ✅ ñíôùë -> ✅ ñíôùë* -> ✅ ñíôùë*+ -> ❌ (because the + as stated) ç -> ✅ ç$ -> ❌
@ -> ❌

(these are the typed strings in the iOS app, the request to the server encloses them between wildcards)

The fact is, i found no characters that are scaped in the request, not sure if this will affect somehow.

OTOH, it's maybe something related with my test server, but it searches only till 8 characters. The word demonstration is in one of my files:

Searching by:

d ,de... demonstr(8 chars) are ✅ , the file is listed demonstra (and longer like demonstration) ❌ not listed

i found nothing in the docu about the limit of characters to search by.

Thanks for your support.

jesmrec avatar Mar 13 '25 09:03 jesmrec

@jesmrec Sorry if I confused you. You shouldn't care about escaping, the backend is doing it. https://github.com/owncloud/ocis/blob/c2736dfb471619f517ebb6759ebf9502d9340040/services/search/pkg/query/bleve/compiler.go#L31

I used the Bleve CLI search response on screenshots to give us an understanding that a problem is not on the search pattern side. Because in a current implementation we parse the pattern to KQL AST and then build the Bleve request.

2403905 avatar Mar 13 '25 15:03 2403905

I'm not sure why we have a "KQL->bleve" translator which is a big point of failure when just bleve (and by extension any search engine such as elasticsearch) is very complex to understand and to debug...

Having said that, I'll focus my explanation on bleve, using the bleve command line, so I'll skip any KQL to bleve translation being made (if any)


The default mapping configuration for the bleve index is below. Mapping configuration is in https://github.com/owncloud/ocis/blob/master/services/search/pkg/engine/bleve.go#L73 which results in the following:

{
  "default_mapping": {
    "enabled": true,
    "dynamic": true,
    "properties": {
      "Content": {
        "enabled": true,
        "dynamic": true,
        "fields": [
          {
            "type": "text",
            "analyzer": "fulltext",
            "store": true,
            "index": true,
            "include_term_vectors": true,
            "docvalues": true
          }
        ]
      },
      "Name": {
        "enabled": true,
        "dynamic": true,
        "fields": [
          {
            "type": "text",
            "analyzer": "lowercaseKeyword",
            "store": true,
            "index": true,
            "include_term_vectors": true,
            "include_in_all": true,
            "docvalues": true
          }
        ]
      },
      "Tags": {
        "enabled": true,
        "dynamic": true,
        "fields": [
          {
            "type": "text",
            "analyzer": "lowercaseKeyword",
            "store": true,
            "index": true,
            "include_term_vectors": true,
            "docvalues": true
          }
        ]
      }
    }
  },
  "type_field": "_type",
  "default_type": "_default",
  "default_analyzer": "keyword",
  "default_datetime_parser": "dateTimeOptional",
  "default_field": "_all",
  "store_dynamic": true,
  "index_dynamic": true,
  "docvalues_dynamic": true,
  "analysis": {
    "analyzers": {
      "fulltext": {
        "token_filters": [
          "to_lower",
          "stemmer_porter"
        ],
        "tokenizer": "unicode",
        "type": "custom"
      },
      "lowercaseKeyword": {
        "token_filters": [
          "to_lower"
        ],
        "tokenizer": "single",
        "type": "custom"
      }
    }
  }
}

If no field is provided in the search, the "_all" field will be used (basically a "_all:searchTerm" search), which uses a keyword analyzer. The "keyword" analyzer stores the input without changes. This is used for exact matches. Wildcards can be used.

For example, assuming you upload a file called "sum.txt":

  • sum.txt -> finds it
  • sum* -> finds it
  • Sum* -> does NOT find it. This is equivalent to _all:Sum*, and there is no lowercase conversion for the "_all" field
  • Sum.txt -> does NOT find it. _all:Sum.txt doesn't perform analysis on the input, so no match is found.
  • Name:Sum.txt -> finds it. Configured analysis for the field transform the value to lowercase, and then it matches the stored value
  • Name:Sum* -> does NOT find it. I don't know exactly why, but I assume it's because the "to_lower" token filter is skipped due to the wildcard

For content search, it's more complex. Based on the following content:

one+one=two
four+four=eight

ñíôùë*+ç$%&@?
demonstration

There is an unicode tokenization going on (as configured). I don't know the details, but it seems the tokens contain just unicode letters.

  • one+one=two will have the tokens one,one and two (no "+" nor "=")
  • ñíôùë*+ç$%&@? will have the tokens ñíôùë and ç, so most of the symbols will be removed.

In addition to that, there is also a stemming transformation (also as configured).

In the end, I guess the indexed data is stored as follows (only checking with one document):

$bleve dictionary bleve Content
demonstr - 1
eight - 1
four - 1
on - 1
two - 1
ç - 1
ñíôùë - 1

Examples:

  • Content:demo -> not found. It isn't in the dictionary
  • Content:demo* -> matches demonstr
  • Content:Demo* -> not found. As said above, it seems wildcards skip the analyzer, so it isn't converted to lowercase.
  • Content:demonstrator -> matches demonstr. I guess it has the same stem.
  • Content:DEmonstrator -> matches demonstr. It's converted to lowercase and then it has the same stem (as above)
  • Content:demonstrat -> not found. The stem seems to be "demonstrat" which doesn't match anything
  • Content:\= -> not found. It doesn't generate any token
  • Content:ç$ -> matches just the ç portion

I think that covers most of the questions. Note that this is just for bleve and has been checked using the bleve command line. There is KQL layer on top, so some things might be different, specially with weird chars that I don't know how they'll be converted from KQL to bleve syntax.

In addition, the web client is sending (name:&quot;*demo cont*&quot; OR content:&quot;demo cont&quot;) or (name:"*demo cont*" OR content:"demo cont") as query string to the server. I'm pretty sure the KQL to bleve converter at least changes the field names because otherwise bleve wouldn't find a match. I guess we want to have the same behavior, so we should send the same base query (except for what we're searching). However, I'd advise against prepending wildcards because it's considered to have bad performance (elasticsearch recommended against them, so I guess bleve will have the same problems)

Furthermore, in case data is stored weirdly, current mappings can't be changed. If we ever really need to change anything in the mappings, we'll probably need to use the same approach that we did for OC10: creating a new index and move all the data there.

jvillafanez avatar Mar 20 '25 14:03 jvillafanez

Thanks a lot @jvillafanez for the clear explanation. I have just one question:

$bleve dictionary bleve Content
demonstr - 1

why demonstr and not the whole demonstration word in the index? Is it just assuming stems maximum 8-char somehow? or is it part of the KQL-magic on the top?

@felix-schwarz 👇 👇

I guess we want to have the same behavior, so we should send the same base query (except for what we're searching). However, I'd advise against prepending wildcards because it's considered to have bad performance (elasticsearch recommended against them, so I guess bleve will have the same problems)

wildcards should maybe removed from requests since the search engine is already taking care of the matches, in order not to suffer performance penaltys.

jesmrec avatar Mar 26 '25 10:03 jesmrec

why demonstr and not the whole demonstration word in the index?

That's what the "stemmer_porter" does. It's part of the configured analysis done to the file content. The code is linked to https://github.com/blevesearch/go-porterstemmer/ and the algorithm is defined in https://tartarus.org/martin/PorterStemmer/def.txt From our side, consider it "magic" to find related words, so if the document contains "demonstration" you can find the document searching by "demonstrator", "demonstrate", etc. without using the exact word.

jvillafanez avatar Mar 26 '25 17:03 jvillafanez

@jesmrec is there anything concrete we can fix on this ticket?

kobergj avatar Apr 07 '25 08:04 kobergj

not ftm @kobergj , anf thanks @jvillafanez

jesmrec avatar Apr 14 '25 16:04 jesmrec