pycsw icon indicating copy to clipboard operation
pycsw copied to clipboard

FTS on csw:AnyText is not reliable when using wildcards

Open ricardogsilva opened this issue 9 years ago • 2 comments

Description

When querying a postgresql repository with expressions that take the form csw:AnyText like %something%, postgresql's Full Text Search (FTS) should not be used because it leads to erroneous results. This happens because the FTS mechanism does not know how to interpret the wildcard character %.

However, not all queries lead to incorrect results. It actually depends on the content of the request. This happens because when postgresql processes an FTS query, it will apply some pre-processing to the search terms, including stemming and these may significantly alter the result of the matching.

Example:

Lets say we have the following phrase:

I am a robot learning to search text

Now we want to know if the substring 'bot' is in our phrase. Using normal SQL, we'd build a query like this:

SELECT 'I am a robot learning to search text' ILIKE '%bot%';

Which yields a result like:

?column?
--------------
t
(1 row)

The t meaning True, as in we have found a correct match.

Now if we try to use an FTS query things are very different. We'd issue a query like:

SELECT to_tsvector('I am a robot learning to search text') @@ to_tsquery('%bot%')

Now the result is:

?column?
--------------
f
(1 row)

The f meaning False, as in we did not find a match.

This happens because the FTS tools do not know how to interpret the pattern %bot%. No expansion is done on the token.

Lets see how postgresql treats our phrase:

SELECT to_tsvector('I am a robot learning to search text');

The result:

to_tsvector
--------------
'learn':5 'robot':4 'search':7 'text':8
(1 row)

The to_tsvector function stripped away some common words from the phrase and also performed stemming on the word learning, which now became learn.

Now lets see how postgresql handles the matching part of our query (the stuff after the @@ operator)

SELECT to_tsquery('%bot%');

The result:

to_tsquery
--------------
'bot'
(1 row)

The to_tsquery function strips away our % wildcards and just returns the substring. It also performs stemming, although that is not visible here because the word bot is already a stem.

This means that:

  1. An FTS query will not fulfill our hope that the database would expand '%bot%' into 'robot' and find a match. I'm thinking that we should just disable FTS in csw:AnyText type of constraints whenever the constraint includes the wildcard character. What is your opinion?

  2. It also implies that some queries may also return more results than desirable, if the original expressions are stemmed into more generic strings. I haven't thought of a nice solution to this problem yet... Maybe this is not a problem at all, and perhaps the user can iteratively select more appropriate search terms. What is your opinion?

In pycsw we are using FTS when the following conditions are met:

  • The underlying repository is of type postgresql
  • There is a suitable index in the database to allow fast(er) FTS processing
  • The property requested in the constraint is csw:AnyText

Environment

  • operating system:
  • Python version:
  • pycsw version:
  • source/distribution
  • [x] git clone
  • [ ] DebianGIS/UbuntuGIS
  • [ ] PyPI
  • [ ] zip/tar.gz
  • [ ] other (please specify):
  • web server
  • [x] Apache/mod_wsgi
  • [ ] CGI
  • [ ] other (please specify):

Steps to Reproduce

  • Fire up a postgresql database
  • Load it with the test data from the sru suite
  • Perform the second test on the requests.txt file of the sru suite. The test is something like
search,mode=sru&version=1.1&operation=searchRetrieve&query=lor

The test will transform the query=lor KVP parameter into csw:AnyText like '%lor%'. It will then query the database using FTS and will come up with zero results. The test will fail because it expects 5 records as a result. This test will not fail when using sqlite as a repository because there it will not use FTS and stick with the boring old SQL.

Additional Information

  • This problem was uncovered while working on RFC8
  • Reproducing this problem is perhaps not so straightforward because our tests are not currently working with a postgresql repository. Running paver test -d PostgreSQL throws a code error.
  • Seeing as postgresql is a more capable RDBMS for production deployments, we should probably adapt our current CI pipeline to use it instead of sqlite. I am paving the way for this with the work from RFC8, which will allow selecting between sqlite and postgresql when running tests. After this we could explore ways to use postgresql in Travis instead of sqlite.

ricardogsilva avatar Dec 08 '16 21:12 ricardogsilva

To clarify the issue with a more realistic use case, the mentioned test in the sru suite fails because the query is transformed to csw:AnyText like '%lor%'. The test expects records that contain the word lorem to be found by the query but since postgresql repositories will use FTS, the search term will be passed to the to_tsquery() function and the FTS searching does not expand the % wildcard.

This means that FTS will fail to find records that have the word lorem because it does not use the correct search term.

A consequence of this fact is that some our tests (which all pass on sqlite) are indeed failing with postgresql and this has been slipping by unnoticed due to lack of using postgresql in the CI pipeline.

ricardogsilva avatar Dec 09 '16 23:12 ricardogsilva

I had a problem with FTS in my anytext filter as well. I need to filter 4 words in sequence, but not when they show up alone in the rest of anytext. In FTS option, the singlechar param is ignored which won't let me force it to be in sequence.

maurelioc avatar Aug 17 '21 14:08 maurelioc