pagefind icon indicating copy to clipboard operation
pagefind copied to clipboard

Support for Soft Hyphens, a first step to better indexing for languages like German

Open dirk68-fu opened this issue 1 year ago • 1 comments

In the following thread the problem is discussed that in German words are often a composition of simpler words. For example „Hochspannungsnetzgerät“ is a composition of „Hochspannung“ (High voltage) and „Netzgerät“ (power supply). If I search for „Netzgerät“ with pagefind it currently does not find „Hochspannungsnetzgerät“ though the term is included and semantically it most certainly is a kind of „Netzgerät“, so the user would expect to find it.

There seems to be no easy solution to that problem. But a first step would be to add optional support for soft hyphen characters. Pagefind should treat the soft hyphen as a word boundary. This would enable the generators of the static html to include this hints for pagefind in the page.

dirk68-fu avatar Nov 14 '24 10:11 dirk68-fu

👋

This can be added, and no need for it to be optional. Pagefind already indexes multiple words for a given location when required — e.g. when it encounters a word source_text it will index source_text, source, and text at the given location. I can add the soft hyphen to this list which will roll it into the same handling.

bglw avatar Nov 20 '24 19:11 bglw