tasking-manager icon indicating copy to clipboard operation
tasking-manager copied to clipboard

Project Search Full-Text Partial Word Auto-complete

Open thadk opened this issue 5 years ago • 2 comments

The new TM 4.0 design calls for an auto-complete project search field on the "Explore Projects" page. After research, this is commonly implemented using an Elastic Search instance which is not part of TM. A few people have tried to implement it using only PostgreSQL but there are manual steps to increase its usefulness.

image

The PostgreSQL GIN index, which is already set up for the project search, likely needs a new API created to enable the autocomplete in the design spec as it is shown. But there are some choices.

There are two things the autocomplete API could return using PostgreSQL:

  1. query suggestions: suggested words like "Malaria" or "Mozambique" as show in the design mockup. Query suggestions are made up of ngrams. Ngrams are are small sets of words together.
  2. actual projects results: actual project titles matching some of the words or partial word typed so far

There is a third option. It is mostly client-side and technically simpler. We could offer autocomplete for all titles of projects in the active search. We could add the project titles to the project/search API for all pages of the results, along with the id and map location. For 3000 projects, this should be easy to handle on the client side. We could still allow the user to ignore the autocomplete and use the full-text search too.

The fourth option is to leave the text search box as-is and/or to just make the existing project/search search support partial word prefix matching. It could have no new autocomplete popup since it already live-updates shown project cards on typing. It would be a bit slower.

Currently when the project/search is given a text query, the API only responds with projects that contain the matching whole words from the index. Instead, this new API should respond given incomplete words with ngrams or projects the GIN full text search index.

“prefix matching” using a syntax such as to_tsquery('humanit:*') as below for the test & ar... search. It can allow us to do an example search of "ssa ar" to find the results with "SSA" or "arbitrary" in my test database. (note: the main project/search currently replaces spaces with | operator): SELECT * FROM project_info WHERE text_searchable @@ (to_tsquery('ssa | ar:*')) = true; image

For 1) my research indicates we would need to build an ngram table from the document text which we would return the query suggestions from, based on frequency of the ngrams. We might update it on inserts. I haven't seen any examples of re-using the tsvector data we have in project_info.text_searchable. Here is a in-depth guide with pseudocode: https://towardsdatascience.com/implementing-auto-complete-with-postgres-and-python-e03d34824079

For 2) I think we can mostly use the code we have and just simplify the query from project/search to make it faster with a new endpoint and add the prefix matching feature above. I don't think it would be as useful as query suggestions. This elastic search partial reimplementation might help if the simple prefix matching doesn't work: https://stackoverflow.com/questions/56894979/edge-ngram-search-in-postgresql

I think the main thing is that we not over-engineer this. Based on the research, it could be a time sink to implement this. With only about 3,000 projects, there is not really much text to search. A basic solution probably solves it.

Other resources

  • very simple postgresql text search
  • https://www.postgresql.org/docs/11/textsearch-controls.html
  • https://www.postgresql.org/docs/9.6/pgtrgm.html
  • https://github.com/wagtail/wagtail/issues/3709 and https://gist.github.com/BertrandBordage/4aee04f6931e4fa8c2c351eb7eeb8e31
  • https://czep.net/17/full-text-search.html
  • select to_tsvector('simple', 'Yorick Peterse') @@ to_tsquery('simple', 'yor:*') --true: the "simple" turns off English specific features but you cannot mix dictionaries.
  • https://postgrespro.com/list mailing list search query suggestions do use PostgreSQLpg_trgm according to this document linked by CrunchyData/Paul Ramsey at FOSS4G-NA though no source obvious.

thadk avatar Sep 02 '19 04:09 thadk

closed as per #1859

bopercival-hot avatar Aug 16 '21 18:08 bopercival-hot

As #1859 was not merged this issue is still open. There are too many conflicts on #1859 to merge so need some rework on this.

Aadesh-Baral avatar Jun 21 '22 05:06 Aadesh-Baral