pagefind Improve handling of punctuation in segmented languages

Improve handling of punctuation in segmented languages

Open bglw opened this issue 2 years ago • 4 comments

In whitespace-delimited languages, when Pagefind encounters a-b it will be indexed as ab. In languages that go through segmentation, this might have been first segmented to 'a', '-', 'b' which indexes as the words a and b with a word in between. This makes it hard to search for, as on the client-side a-b will continue to search for ab, and an exact search for "a b" or "a - b" won't match due to the ignored word indexed between a and b.

We don't have the segmentation available on the client due to bandwidth constraints, so a different solution will need to be found. One easy(ish) option would be for the client to search for a-b as ab and a b.

Jul 12 '23 00:07 bglw

Another option would be to swap out for different punctuation logic in segmented languages, so that in these cases a-b is always 'a', '-', 'b'.

Jul 12 '23 00:07 bglw

Hi, I think the same problems occur with the symbol ' that can be placed in front of a word in French (it's the contraction of the article le, which becomes l'). For example, a search for alphabet should also return results in which l'alphabet can be found. Thanks in advance and congratulations on a very good job!

Sep 07 '23 15:09 hjonin

Hey @hjonin 👋

The French case is slightly different (in a good way) — this current issue only applies to non-whitespace-delimited languages like Chinese.

For indexing l'alphabet, that will be resolved by #225 which will be released on a stable version next week 😄

The current v1.0.0-beta.2 release includes the behavior, though it currently doesn't split on the ' symbol. But you bring up a great point re:French, so I'll expand the logic to cover it (and probably just all punctuation for now, and we can configure it down later if need be).

Here's an example of the new word indexing finding the attribute in html_attribute:

Sep 07 '23 20:09 bglw

Hi @bglw

Thank you very much for your answer! Can't wait to use the new version!

Sep 14 '23 08:09 hjonin

pagefind pagefind copied to clipboard

Improve handling of punctuation in segmented languages

pagefind
pagefind copied to clipboard