pagefind icon indicating copy to clipboard operation
pagefind copied to clipboard

Segmentation in browser possible for specialized languages with Intl.Segmenter

Open jonex2 opened this issue 11 months ago • 1 comments

Here at the end is mentioned that for specialized languages like Chinese or Japanese Pagefind is not able to segment it into words in the browser.

With the JS-API Intl.Segmenter this is possible in all major browsers, s. mdn

So the example of the Pagefind-doc works with that:

const segmenterZh = new Intl.Segmenter("zh", { granularity: "word" });
const string1 = "每個月都";

const iterator1 = segmenterZh.segment(string1)[Symbol.iterator]();

console.log(iterator1.next().value.segment);
// output: '每個'

console.log(iterator1.next().value.segment);
// output: '月'

console.log(iterator1.next().value.segment);
// output: '都'

jonex2 avatar May 22 '25 11:05 jonex2

Hello! 👋

Yes, I've been following this one (with excitement!), but haven't found the time to get it all plugged in yet. Thanks for opening an issue for it!

bglw avatar May 22 '25 21:05 bglw