minisearch icon indicating copy to clipboard operation
minisearch copied to clipboard

Excuse me, how to support other language search, such as Chinese search, thank you

Open sinianzhiren opened this issue 3 years ago • 7 comments

Excuse me, how to support other language search, such as Chinese search, thank you .

sinianzhiren avatar Feb 01 '23 09:02 sinianzhiren

Hello @sinianzhiren , in principle, MiniSearch should be able to work with any language. In practice, in some cases it might be necessary to tweak some options, but I think that the defaults should be a good starting point.

Unfortunately I do not know enough about Chinese language to guide you here, but other users have successfully used MiniSearch for Chinese (look for example at this comment or at this issue).

Did you encounter a specific problem supporting Chinese or other languages? If so, you can describe it, and I would be happy to help if I can.

lucaong avatar Feb 01 '23 09:02 lucaong

Excuse me, how to support other language search, such as Chinese search, thank you .

You should do Chinese word segmentation with library like nodejieba before indexing documents.

SSShooter avatar Feb 07 '23 08:02 SSShooter

If you don't care about supporting Firefox, Intl.Segmenter is great.

connebs avatar Jul 04 '23 11:07 connebs

Firefox Nightly support Intl.Segmenter.

Use Intl.Segmenter support CJK example:

const segmenter =
  Intl.Segmenter && new Intl.Segmenter("zh", { granularity: "word" });

const miniSearch = new MiniSearch({
  fields: ["text"],
  processTerm: (term) => {
    if (!segmenter) return term;
    const tokens = [];
    for (const seg of segmenter.segment(term)) {
      tokens.push(seg.segment);
    }
    return tokens;
  },
});

const documents = [
  { id: 1, text: "为字段添加 required 属性,并在提交时进行表单验证" },
  {
    id: 2,
    text: "By default, the same processing is applied to search queries. In order to apply a different processing to search queries, supply a processTerm search option:",
  },
];

miniSearch.addAll(documents);
console.log("===");
console.log(miniSearch.search("添加"));

This is online example: https://duoyun-ui.gemjs.org/zh/ Search front end use @docsearch/js

mantou132 avatar Jan 14 '24 11:01 mantou132

I also encountered the problem of searching Chinese, for example, when searching for "预置", due to the problem of word segmentation, the content cannot be searched due to the word segmentation of "预" and "置", my project uses vitepress, indirectly uses minisearch, and finally I configured it like this to support search:

...
export default defineConfig({
  ...
  themeConfig: {
    search: {
      options: {
        miniSearch: {
          options: {
            tokenize: (term) => {
              if (typeof term === 'string') term = term.toLowerCase();
              // @ts-ignore
              const segmenter = Intl.Segmenter && new Intl.Segmenter("zh", { granularity: "word" });
              if (!segmenter) return [term];
              const tokens = [];
              for (const seg of segmenter.segment(term)) {
                // @ts-ignore
                tokens.push(seg.segment);
              }
              return tokens;
            },
          },
          searchOptions: {
            combineWith: 'AND', // important for search chinese
            processTerm: (term) => {
              if (typeof term === 'string') term = term.toLowerCase();
              // @ts-ignore
              const segmenter = Intl.Segmenter && new Intl.Segmenter("zh", { granularity: "word" });
              if (!segmenter) return term;
              const tokens = [];
              for (const seg of segmenter.segment(term)) {
                // @ts-ignore
                tokens.push(seg.segment);
              }
              return tokens;
            },
          },
        },
      },
    },
  },
  ...
});

Thanks to @mantou132

ThomasChan avatar Jul 15 '24 01:07 ThomasChan