synapse icon indicating copy to clipboard operation
synapse copied to clipboard

Search for URLs in messages doesn't return any results

Open WGH- opened this issue 6 years ago • 16 comments

Description

Message text search doesn't work with URL. Even if you search for URL exactly as it was in a message, search will not return any results.

Steps to reproduce

  • Post some link in some room, e.g. https://www.youtube.com/watch?v=oHg5SJYRHA0
  • Enter the same link in the search box: https://www.youtube.com/watch?v=oHg5SJYRHA0
  • Press enter

The search results will not include the message with the link, which is unexpected.

Version information

  • Homeserver: self-hosted
  • Version: matrix-synapse==0.26.0
  • Install method: pip
  • Platform: Linux, PostgreSQL 9.6

WGH- avatar Mar 25 '18 14:03 WGH-

Plain SQL query returns the message like a charm:

synapse=> select event_id, content from events where content like '%https://www.youtube.com/watch?v=oHg5SJYRHA0%';
            event_id            |                                  content                                  
--------------------------------+---------------------------------------------------------------------------
 $15219863901020QNrdM:torlan.ru | {"body":"https://www.youtube.com/watch?v=oHg5SJYRHA0","msgtype":"m.text"}
(1 row)

WGH- avatar Mar 25 '18 14:03 WGH-

Just ran into this same issue. Tried searching a room for a specific link and it wasn't finding anything but I found it when searching manually.

skylord123 avatar Mar 29 '18 17:03 skylord123

Related to https://github.com/vector-im/riot-web/issues/4486 ?

werner291 avatar Sep 27 '19 16:09 werner291

Managed to reproduce in Riot on my homeserver. Will try upgrading server to see if the issue persists. (Though shouldn't be Riot-specific since the request itself seems to be failing.)

werner291 avatar Sep 27 '19 16:09 werner291

Also occurs on latest develop version when running live.

werner291 avatar Sep 27 '19 17:09 werner291

This bit looks suspicious: https://github.com/matrix-org/synapse/blob/master/synapse/storage/search.py#L398

ts_search appears to be doing some kind of stopword removal and such. I can imagine URI's might get mangled by that https://www.postgresql.org/docs/9.5/textsearch-controls.html

werner291 avatar Sep 27 '19 19:09 werner291

And the tests were passing since they are run against SQLite, which uses a somewhat more straightforward approach.

werner291 avatar Sep 27 '19 19:09 werner291

SELECT vector FROM event_search WHERE vector @@ to_tsquery('english', 'www.youtube.com');

Curiously, this does seem to produce a couple results, suggesting to_tsquery isn't the roadbloack I thought it might be.

werner291 avatar Sep 27 '19 19:09 werner291

Yep: PostgreSQL is rather iffy when it comes to handling URI's.

You end up with queries like WHERE vector @@ to_tsquery('english', 'youtube.com:*'); returning 0 results, whereas WHERE vector @@ to_tsquery('english', 'www.youtube.com:*') suddenly finds the event you're looking for, but then WHERE vector @@ to_tsquery('english', 'youtube:*') suddenly doesn't work again.

This is made worse by the fact that Synapse mangles a query string like "www.youtube.com" into something like www:* & youtube:* & com:* which prevents us from hitting that one case where we get a useful result.

werner291 avatar Sep 28 '19 00:09 werner291

Possibly relevant: https://stackoverflow.com/questions/35402243/postgres-full-text-search-ignore-url

werner291 avatar Sep 28 '19 11:09 werner291

How about we pre-process the search index like this?

SELECT translate('https://www.youtube.com/watch?v=dQw4w9WgXcQ', './:?=', ' ');

This results in a breakdown like this: https www youtube com watch vdQw4w9WgXcQ which should be easier to parse for the search engine.

SELECT to_tsvector('english', translate('https://www.youtube.com/watch?v=dQw4w9WgXcQ', './:?=', '  '));
 >  'com':4 'https':1 'vdqw4w9wgxcq':6 'watch':5 'www':2 'youtub':3

Much nicer parsing result.

werner291 avatar Sep 28 '19 12:09 werner291

Removing special characters would happen around here then, probably: https://github.com/matrix-org/synapse/blob/master/synapse/storage/search.py#L326

Note that we're essentially already doing that here: https://github.com/matrix-org/synapse/blob/master/synapse/storage/search.py#L695

werner291 avatar Sep 28 '19 12:09 werner291

How odd... When entering the URI in Riot search it doesn't seem to always work, yet it seems mostly reliable when running the tests. Yes, I'm running with SYNAPSE_POSTGRES=1

werner291 avatar Sep 28 '19 13:09 werner291

:man_facepalming: I forgot how federation worked, was running the search against a non-upgraded server.

On a more positive note, URI search seems to work very well!

werner291 avatar Sep 28 '19 13:09 werner291

Note that the improved search will only work on messages newly inserted into the search index. (Old messages will behave the same as before.)

It will also only work on PostgreSQL. SQlite seemed to already be sorta working before so i didn't touch it.

werner291 avatar Sep 28 '19 13:09 werner291

I've just had a quick look and it's still an issue. For what it's worth by looking at the search vector and testing locally a bit it looks like dropping the https:// bit from the search query allows the URL to be searched:

     key      |                                        vector                                        
--------------+--------------------------------------------------------------------------------------
 content.body | '/watch?v=ohg5sjyrha0':3 'www.youtube.com':2 'www.youtube.com/watch?v=ohg5sjyrha0':1

babolivier avatar Jul 21 '22 13:07 babolivier