outline icon indicating copy to clipboard operation
outline copied to clipboard

Search should support CJK

Open YuJianghao opened this issue 3 years ago • 28 comments
trafficstars

When searching for CJK word, only words at the very beginning of each sentences can be searched.

To Reproduce Steps to reproduce the behavior:

  1. Create a document
  2. Add content 这是一个测试文档
  3. Save and publish
  4. Search for 测试
  5. The document isn't in the search result

Expected behavior The document with 这是一个测试文档 should be shown in the search result

Outline (please complete the following information):

  • Install: self hosted
  • Version: docker: outlinewiki/outline:0.62.0

YuJianghao avatar Mar 07 '22 03:03 YuJianghao

See previous discussion: https://github.com/outline/outline/issues/826

tommoor avatar Mar 07 '22 05:03 tommoor

I guess this is due to the fact that postgreSQL's Chinese text segmentation system is not very perfect.

you can replace server/models/Document.js:601

    // Build the SQL query to get documentIds, ranking, and search term context
    const whereClause = `
  "searchVector" @@ to_tsquery('english', :query) AND

with

  const keywords = `${escape("%" + query + "%")}`;
  // Build the SQL query to get documentIds, ranking, and search term context
  const whereClause = `
    (text LIKE ${keywords} OR title LIKE ${keywords}) AND

However, I think it is better to let users decide whether to use the more efficient postgreSQL full-text search or the less efficient SQL LIKE by setting it up.

But since I only write PHP, I'm still figuring out how to contribute to this modification lol

ckmarkhsu avatar Apr 18 '22 05:04 ckmarkhsu

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days

github-actions[bot] avatar Aug 17 '22 02:08 github-actions[bot]

I fixed this problem not perfectly

  1. Install word segmentation plugin zhparser. For me just replace the database docker image to abcfy2/zhparser:13
  2. Connect database and execute script
CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION chinese_zh (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION chinese_zh ADD MAPPING FOR n,v,a,i,e,l WITH simple;


CREATE OR REPLACE FUNCTION public.atlases_search_trigger()
 RETURNS trigger
 LANGUAGE plpgsql
AS $function$
begin
  new."searchVector" :=
    setweight(to_tsvector('chinese_zh', coalesce(new.name, '')),'A') ||
    setweight(to_tsvector('chinese_zh', coalesce(new.description, '')), 'C');
  return new;
end
$function$
;

CREATE OR REPLACE FUNCTION public.documents_search_trigger()
 RETURNS trigger
 LANGUAGE plpgsql
AS $function$
    begin
      new."searchVector" :=
        setweight(to_tsvector('chinese_zh', coalesce(new.title, '')),'A') ||
        setweight(to_tsvector('chinese_zh', coalesce(array_to_string(new."previousTitles", ' , '),'')),'C') ||
        setweight(to_tsvector('chinese_zh', coalesce(new.text, '')), 'B');
      return new;
    end
    $function$
;

update documents 
set "searchVector"=setweight(to_tsvector('chinese_zh', coalesce("text", '')),'A')||
setweight(to_tsvector('chinese_zh', coalesce("text", '')),'C');

Now search a chinese word should be mostly ok, but not work for search a phrase or sentence. It need some code.

zhpjy avatar Aug 17 '22 14:08 zhpjy

I guess this is due to the fact that postgreSQL's Chinese text segmentation system is not very perfect.

you can replace server/models/Document.js:601

    // Build the SQL query to get documentIds, ranking, and search term context
    const whereClause = `
  "searchVector" @@ to_tsquery('english', :query) AND

with

  const keywords = `${escape("%" + query + "%")}`;
  // Build the SQL query to get documentIds, ranking, and search term context
  const whereClause = `
    (text LIKE ${keywords} OR title LIKE ${keywords}) AND

However, I think it is better to let users decide whether to use the more efficient postgreSQL full-text search or the less efficient SQL LIKE by setting it up.

But since I only write PHP, I'm still figuring out how to contribute to this modification lol

it works fine.

firer1946 avatar Oct 08 '22 09:10 firer1946

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days

github-actions[bot] avatar Feb 06 '23 01:02 github-actions[bot]

To continue from @RickCogley's comment in https://github.com/outline/outline/issues/826#issuecomment-748992000, Meilisearch have released a stable v1.0.

almereyda avatar Feb 10 '23 15:02 almereyda

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days

github-actions[bot] avatar Jun 11 '23 02:06 github-actions[bot]

This issue is far from solved, as I think it's hardly coupled with the intuition of @tommoor here #1250

Anyway, searching documents is still extremely buggy and an option like @almereyda proposed by implementing Meilisearch seems to be valid way to follow, or in a maximalist (not needed in my perspective) the AI search proposed here #5337

In any case, search problems should be properly addressed and not just let them be stale to earn the close status.

matbgn avatar Jun 11 '23 09:06 matbgn

This issue is regarding CJK support, if you have found other specific bugs I'd recommend filing them separately. Please remember that I am under no obligation to work in public and provide this project for free – work gets done at a pace that I can take on.

tommoor avatar Jun 11 '23 13:06 tommoor

I've mainly responded to the bot about its willingness to close this issue, but still this issue has sense, as the search of sentences with diacritics in French behave the same as described here in the comment of 17 August 22.

The goal is also not to open many similar (almost clone in this case) bugs as it will load your desk and it will mean fighting with multiple bot stale instances for the end users.

Side note: I personally really appreciate and hardly value your work, so no mean to offend you, sorry if that was the case. I'm afraid my English is a bit lacking.

matbgn avatar Jun 11 '23 15:06 matbgn

After some research, I found a solution for me:

  1. Use pgroonga image as database. It adds fast full text search for postgres and based on postgres official docker image, so the parameters/config are totally compatible, you can easily replace it without change anything. for me, i use groonga/pgroonga:latest-debian-14 tag; you can choose yours from here
  2. create index:
CREATE EXTENSION IF NOT EXISTS pgroonga;
CREATE INDEX text_pgroonga_index ON documents USING pgroonga (text);
CREATE INDEX title_pgroonga_index
ON documents
USING pgroonga (title pgroonga_varchar_full_text_search_ops_v2);
  1. change code, inspired by @ckmarkhsu server/models/helpers/SearchHelper.ts :317
    // Build the SQL query to get documentIds, ranking, and search term context
    const whereClause = `
  "searchVector" @@ to_tsquery('english', :query) AND

with

    const keywords = `${"'" + query + "'"}`;
    // Build the SQL query to get documentIds, ranking, and search term context
    const whereClause = `
    (text &@~ ${keywords} OR title &@~ ${keywords}) AND

Here is my fork, and docker build

You can cherry pick the cjk branch and build by yourself or just use my image.

danfate avatar Jul 10 '23 16:07 danfate

Wow, impressive!

@tommoor Do you think something like this could be done with actual architecture?

Is a "good first PR" just around the corner? And if not, how can we help to make it correctly defined? E.g. with a good definition of project's steps?

matbgn avatar Jul 14 '23 20:07 matbgn

Unfortunately it doesn't feel maintainable to support multiple types of database, but I'm glad the open nature allows for this kind of fork.

tommoor avatar Jul 14 '23 22:07 tommoor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days

github-actions[bot] avatar Nov 12 '23 01:11 github-actions[bot]

Still relevant. Search enhancement is on the plate (possibly mine...)

matbgn avatar Nov 12 '23 08:11 matbgn

As I received a message about this PR https://github.com/hakimel/reveal.js/pull/3532/files

I was thinking if something stupid simple like this could be implemented for Outline, on the frontend, before requesting to the DB to solve diacritics and maybe also sentence searching problems. 🤔

matbgn avatar Dec 18 '23 09:12 matbgn

The problem is in the way that postgres indexes these characters I'm afraid. I expect #5337 to be a nice workaround as it avoids the postgres index entirely

tommoor avatar Dec 18 '23 13:12 tommoor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days

github-actions[bot] avatar Apr 17 '24 01:04 github-actions[bot]

Automatically closed due to inactivity

github-actions[bot] avatar Apr 22 '24 01:04 github-actions[bot]

5 days after warning is just too short for the community to respond... We are not working full-time to maintain bug that should be kept open.

Sad 😔

Could it be possible to balance the stale setting to 60/30 instead of 90/5 @tommoor ?

matbgn avatar Apr 22 '24 07:04 matbgn