pagefind
pagefind copied to clipboard
Improve handling of punctuation in segmented languages
In whitespace-delimited languages, when Pagefind encounters a-b
it will be indexed as ab
. In languages that go through segmentation, this might have been first segmented to 'a', '-', 'b'
which indexes as the words a
and b
with a word in between. This makes it hard to search for, as on the client-side a-b
will continue to search for ab
, and an exact search for "a b"
or "a - b"
won't match due to the ignored word indexed between a
and b
.
We don't have the segmentation available on the client due to bandwidth constraints, so a different solution will need to be found. One easy(ish) option would be for the client to search for a-b
as ab
and a b
.
Another option would be to swap out for different punctuation logic in segmented languages, so that in these cases a-b
is always 'a', '-', 'b'
.
Hi, I think the same problems occur with the symbol '
that can be placed in front of a word in French (it's the contraction of the article le
, which becomes l'
).
For example, a search for alphabet
should also return results in which l'alphabet
can be found.
Thanks in advance and congratulations on a very good job!
Hey @hjonin 👋
The French case is slightly different (in a good way) — this current issue only applies to non-whitespace-delimited languages like Chinese.
For indexing l'alphabet
, that will be resolved by #225 which will be released on a stable version next week 😄
The current v1.0.0-beta.2
release includes the behavior, though it currently doesn't split on the '
symbol. But you bring up a great point re:French, so I'll expand the logic to cover it (and probably just all punctuation for now, and we can configure it down later if need be).
Here's an example of the new word indexing finding the attribute
in html_attribute
:
Hi @bglw
Thank you very much for your answer! Can't wait to use the new version!