outline
outline copied to clipboard
Search should support CJK
When searching for CJK word, only words at the very beginning of each sentences can be searched.
To Reproduce Steps to reproduce the behavior:
- Create a document
- Add content
这是一个测试文档 - Save and publish
- Search for
测试 - The document isn't in the search result
Expected behavior
The document with 这是一个测试文档 should be shown in the search result
Outline (please complete the following information):
- Install: self hosted
- Version: docker: outlinewiki/outline:0.62.0
See previous discussion: https://github.com/outline/outline/issues/826
I guess this is due to the fact that postgreSQL's Chinese text segmentation system is not very perfect.
you can replace server/models/Document.js:601
// Build the SQL query to get documentIds, ranking, and search term context
const whereClause = `
"searchVector" @@ to_tsquery('english', :query) AND
with
const keywords = `${escape("%" + query + "%")}`;
// Build the SQL query to get documentIds, ranking, and search term context
const whereClause = `
(text LIKE ${keywords} OR title LIKE ${keywords}) AND
However, I think it is better to let users decide whether to use the more efficient postgreSQL full-text search or the less efficient SQL LIKE by setting it up.
But since I only write PHP, I'm still figuring out how to contribute to this modification lol
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days
I fixed this problem not perfectly
- Install word segmentation plugin
zhparser. For me just replace the database docker image toabcfy2/zhparser:13 - Connect database and execute script
CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION chinese_zh (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION chinese_zh ADD MAPPING FOR n,v,a,i,e,l WITH simple;
CREATE OR REPLACE FUNCTION public.atlases_search_trigger()
RETURNS trigger
LANGUAGE plpgsql
AS $function$
begin
new."searchVector" :=
setweight(to_tsvector('chinese_zh', coalesce(new.name, '')),'A') ||
setweight(to_tsvector('chinese_zh', coalesce(new.description, '')), 'C');
return new;
end
$function$
;
CREATE OR REPLACE FUNCTION public.documents_search_trigger()
RETURNS trigger
LANGUAGE plpgsql
AS $function$
begin
new."searchVector" :=
setweight(to_tsvector('chinese_zh', coalesce(new.title, '')),'A') ||
setweight(to_tsvector('chinese_zh', coalesce(array_to_string(new."previousTitles", ' , '),'')),'C') ||
setweight(to_tsvector('chinese_zh', coalesce(new.text, '')), 'B');
return new;
end
$function$
;
update documents
set "searchVector"=setweight(to_tsvector('chinese_zh', coalesce("text", '')),'A')||
setweight(to_tsvector('chinese_zh', coalesce("text", '')),'C');
Now search a chinese word should be mostly ok, but not work for search a phrase or sentence. It need some code.
I guess this is due to the fact that postgreSQL's Chinese text segmentation system is not very perfect.
you can replace server/models/Document.js:601
// Build the SQL query to get documentIds, ranking, and search term context const whereClause = ` "searchVector" @@ to_tsquery('english', :query) ANDwith
const keywords = `${escape("%" + query + "%")}`; // Build the SQL query to get documentIds, ranking, and search term context const whereClause = ` (text LIKE ${keywords} OR title LIKE ${keywords}) ANDHowever, I think it is better to let users decide whether to use the more efficient postgreSQL full-text search or the less efficient SQL LIKE by setting it up.
But since I only write PHP, I'm still figuring out how to contribute to this modification lol
it works fine.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days
To continue from @RickCogley's comment in https://github.com/outline/outline/issues/826#issuecomment-748992000, Meilisearch have released a stable v1.0.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days
This issue is far from solved, as I think it's hardly coupled with the intuition of @tommoor here #1250
Anyway, searching documents is still extremely buggy and an option like @almereyda proposed by implementing Meilisearch seems to be valid way to follow, or in a maximalist (not needed in my perspective) the AI search proposed here #5337
In any case, search problems should be properly addressed and not just let them be stale to earn the close status.
This issue is regarding CJK support, if you have found other specific bugs I'd recommend filing them separately. Please remember that I am under no obligation to work in public and provide this project for free – work gets done at a pace that I can take on.
I've mainly responded to the bot about its willingness to close this issue, but still this issue has sense, as the search of sentences with diacritics in French behave the same as described here in the comment of 17 August 22.
The goal is also not to open many similar (almost clone in this case) bugs as it will load your desk and it will mean fighting with multiple bot stale instances for the end users.
Side note: I personally really appreciate and hardly value your work, so no mean to offend you, sorry if that was the case. I'm afraid my English is a bit lacking.
After some research, I found a solution for me:
- Use pgroonga image as database. It adds fast full text search for postgres and based on postgres official docker image, so the parameters/config are totally compatible, you can easily replace it without change anything. for me, i use
groonga/pgroonga:latest-debian-14tag; you can choose yours from here - create index:
CREATE EXTENSION IF NOT EXISTS pgroonga;
CREATE INDEX text_pgroonga_index ON documents USING pgroonga (text);
CREATE INDEX title_pgroonga_index
ON documents
USING pgroonga (title pgroonga_varchar_full_text_search_ops_v2);
- change code, inspired by @ckmarkhsu
server/models/helpers/SearchHelper.ts:317
// Build the SQL query to get documentIds, ranking, and search term context
const whereClause = `
"searchVector" @@ to_tsquery('english', :query) AND
with
const keywords = `${"'" + query + "'"}`;
// Build the SQL query to get documentIds, ranking, and search term context
const whereClause = `
(text &@~ ${keywords} OR title &@~ ${keywords}) AND
Here is my fork, and docker build
You can cherry pick the cjk branch and build by yourself or just use my image.
Wow, impressive!
@tommoor Do you think something like this could be done with actual architecture?
Is a "good first PR" just around the corner? And if not, how can we help to make it correctly defined? E.g. with a good definition of project's steps?
Unfortunately it doesn't feel maintainable to support multiple types of database, but I'm glad the open nature allows for this kind of fork.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days
Still relevant. Search enhancement is on the plate (possibly mine...)
As I received a message about this PR https://github.com/hakimel/reveal.js/pull/3532/files
I was thinking if something stupid simple like this could be implemented for Outline, on the frontend, before requesting to the DB to solve diacritics and maybe also sentence searching problems. 🤔
The problem is in the way that postgres indexes these characters I'm afraid. I expect #5337 to be a nice workaround as it avoids the postgres index entirely
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days
Automatically closed due to inactivity
5 days after warning is just too short for the community to respond... We are not working full-time to maintain bug that should be kept open.
Sad 😔
Could it be possible to balance the stale setting to 60/30 instead of 90/5 @tommoor ?