synapse
synapse copied to clipboard
Search for URLs in messages doesn't return any results
Description
Message text search doesn't work with URL. Even if you search for URL exactly as it was in a message, search will not return any results.
Steps to reproduce
- Post some link in some room, e.g. https://www.youtube.com/watch?v=oHg5SJYRHA0
- Enter the same link in the search box: https://www.youtube.com/watch?v=oHg5SJYRHA0
- Press enter
The search results will not include the message with the link, which is unexpected.
Version information
- Homeserver: self-hosted
- Version: matrix-synapse==0.26.0
- Install method: pip
- Platform: Linux, PostgreSQL 9.6
Plain SQL query returns the message like a charm:
synapse=> select event_id, content from events where content like '%https://www.youtube.com/watch?v=oHg5SJYRHA0%';
event_id | content
--------------------------------+---------------------------------------------------------------------------
$15219863901020QNrdM:torlan.ru | {"body":"https://www.youtube.com/watch?v=oHg5SJYRHA0","msgtype":"m.text"}
(1 row)
Just ran into this same issue. Tried searching a room for a specific link and it wasn't finding anything but I found it when searching manually.
Related to https://github.com/vector-im/riot-web/issues/4486 ?
Managed to reproduce in Riot on my homeserver. Will try upgrading server to see if the issue persists. (Though shouldn't be Riot-specific since the request itself seems to be failing.)
Also occurs on latest develop version when running live.
This bit looks suspicious: https://github.com/matrix-org/synapse/blob/master/synapse/storage/search.py#L398
ts_search appears to be doing some kind of stopword removal and such. I can imagine URI's might get mangled by that https://www.postgresql.org/docs/9.5/textsearch-controls.html
And the tests were passing since they are run against SQLite, which uses a somewhat more straightforward approach.
SELECT vector FROM event_search WHERE vector @@ to_tsquery('english', 'www.youtube.com');
Curiously, this does seem to produce a couple results, suggesting to_tsquery isn't the roadbloack I thought it might be.
Yep: PostgreSQL is rather iffy when it comes to handling URI's.
You end up with queries like WHERE vector @@ to_tsquery('english', 'youtube.com:*');
returning 0 results, whereas WHERE vector @@ to_tsquery('english', 'www.youtube.com:*')
suddenly finds the event you're looking for, but then WHERE vector @@ to_tsquery('english', 'youtube:*')
suddenly doesn't work again.
This is made worse by the fact that Synapse mangles a query string like "www.youtube.com" into something like www:* & youtube:* & com:*
which prevents us from hitting that one case where we get a useful result.
Possibly relevant: https://stackoverflow.com/questions/35402243/postgres-full-text-search-ignore-url
How about we pre-process the search index like this?
SELECT translate('https://www.youtube.com/watch?v=dQw4w9WgXcQ', './:?=', ' ');
This results in a breakdown like this: https www youtube com watch vdQw4w9WgXcQ
which should be easier to parse for the search engine.
SELECT to_tsvector('english', translate('https://www.youtube.com/watch?v=dQw4w9WgXcQ', './:?=', ' '));
> 'com':4 'https':1 'vdqw4w9wgxcq':6 'watch':5 'www':2 'youtub':3
Much nicer parsing result.
Removing special characters would happen around here then, probably: https://github.com/matrix-org/synapse/blob/master/synapse/storage/search.py#L326
Note that we're essentially already doing that here: https://github.com/matrix-org/synapse/blob/master/synapse/storage/search.py#L695
How odd... When entering the URI in Riot search it doesn't seem to always work, yet it seems mostly reliable when running the tests. Yes, I'm running with SYNAPSE_POSTGRES=1
:man_facepalming: I forgot how federation worked, was running the search against a non-upgraded server.
On a more positive note, URI search seems to work very well!
Note that the improved search will only work on messages newly inserted into the search index. (Old messages will behave the same as before.)
It will also only work on PostgreSQL. SQlite seemed to already be sorta working before so i didn't touch it.
I've just had a quick look and it's still an issue. For what it's worth by looking at the search vector and testing locally a bit it looks like dropping the https://
bit from the search query allows the URL to be searched:
key | vector
--------------+--------------------------------------------------------------------------------------
content.body | '/watch?v=ohg5sjyrha0':3 'www.youtube.com':2 'www.youtube.com/watch?v=ohg5sjyrha0':1