tasking-manager
tasking-manager copied to clipboard
Project Search Full-Text Partial Word Auto-complete
The new TM 4.0 design calls for an auto-complete project search field on the "Explore Projects" page. After research, this is commonly implemented using an Elastic Search instance which is not part of TM. A few people have tried to implement it using only PostgreSQL but there are manual steps to increase its usefulness.
data:image/s3,"s3://crabby-images/a057b/a057b9dc6dda2aae491aa7f3325d9a51585bc364" alt="image"
The PostgreSQL GIN index, which is already set up for the project search, likely needs a new API created to enable the autocomplete in the design spec as it is shown. But there are some choices.
There are two things the autocomplete API could return using PostgreSQL:
- query suggestions: suggested words like "Malaria" or "Mozambique" as show in the design mockup. Query suggestions are made up of ngrams. Ngrams are are small sets of words together.
- actual projects results: actual project titles matching some of the words or partial word typed so far
There is a third option. It is mostly client-side and technically simpler. We could offer autocomplete for all titles of projects in the active search. We could add the project titles to the project/search
API for all pages of the results, along with the id
and map location. For 3000 projects, this should be easy to handle on the client side. We could still allow the user to ignore the autocomplete and use the full-text search too.
The fourth option is to leave the text search box as-is and/or to just make the existing project/search
search support partial word prefix matching. It could have no new autocomplete popup since it already live-updates shown project cards on typing. It would be a bit slower.
Currently when the project/search
is given a text query, the API only responds with projects that contain the matching whole words from the index. Instead, this new API should respond given incomplete words with ngrams or projects the GIN full text search index.
“prefix matching” using a syntax such as to_tsquery('humanit:*')
as below for the test & ar...
search. It can allow us to do an example search of "ssa ar
" to find the results with "SSA" or "arbitrary" in my test database.
(note: the main project/search
currently replaces spaces with |
operator):
SELECT * FROM project_info WHERE text_searchable @@ (to_tsquery('ssa | ar:*')) = true;
For 1) my research indicates we would need to build an ngram table from the document text which we would return the query suggestions from, based on frequency of the ngrams. We might update it on inserts. I haven't seen any examples of re-using the tsvector
data we have in project_info.text_searchable
. Here is a in-depth guide with pseudocode: https://towardsdatascience.com/implementing-auto-complete-with-postgres-and-python-e03d34824079
For 2) I think we can mostly use the code we have and just simplify the query from project/search
to make it faster with a new endpoint and add the prefix matching feature above. I don't think it would be as useful as query suggestions. This elastic search partial reimplementation might help if the simple prefix matching doesn't work: https://stackoverflow.com/questions/56894979/edge-ngram-search-in-postgresql
I think the main thing is that we not over-engineer this. Based on the research, it could be a time sink to implement this. With only about 3,000 projects, there is not really much text to search. A basic solution probably solves it.
Other resources
- very simple postgresql text search
- https://www.postgresql.org/docs/11/textsearch-controls.html
- https://www.postgresql.org/docs/9.6/pgtrgm.html
- https://github.com/wagtail/wagtail/issues/3709 and https://gist.github.com/BertrandBordage/4aee04f6931e4fa8c2c351eb7eeb8e31
- https://czep.net/17/full-text-search.html
-
select to_tsvector('simple', 'Yorick Peterse') @@ to_tsquery('simple', 'yor:*') --true
: the "simple" turns off English specific features but you cannot mix dictionaries. - https://postgrespro.com/list mailing list search query suggestions do use PostgreSQL
pg_trgm
according to this document linked by CrunchyData/Paul Ramsey at FOSS4G-NA though no source obvious.
closed as per #1859
As #1859 was not merged this issue is still open. There are too many conflicts on #1859 to merge so need some rework on this.