pagefind icon indicating copy to clipboard operation
pagefind copied to clipboard

Wrong segmentation in Japanese

Open hamano opened this issue 2 years ago โ€ข 7 comments

Thank you for the release of v1.1.0. I was looking forward to the ranking customize feature. However, it seems to not be working in Japanese. When I set the ranking options and perform a search on content with lang="ja", the following error occurs, and the ranking options are not reflected.

Uncaught (in promise) TypeError: wasm.set_ranking_weights is not a function
    at __exports.set_ranking_weights (pagefind.js:1:2087)
    at PagefindInstance.set_ranking (pagefind.js:1:19537)
    at async PagefindInstance.init (pagefind.js:1:20135)
    at async Pagefind.init (pagefind.js:9:1384)

No error is output when --force-language en is specified, and the ranking behaves as expected. Any advice would be greatly appreciated. Kind regards.

hamano avatar Apr 04 '24 09:04 hamano

๐Ÿ‘‹ This isn't an error I would expect to see, due to the way these WebAssembly modules are bundled, and the location of that function.

My first guess would be that you have the prior version of the Japanese WebAssembly cached. Could you try the following?

  • Delete your output directory (e.g. remove the /pagefind/ directory from your site
  • Re-run the latest Pagefind
  • Load in your browser and hard-refresh
  • Alt: Load in your browser in private or incognito

Let me know if you're still seeing the issue after those steps.

bglw avatar Apr 04 '24 09:04 bglw

Thank you for your response. As you pointed out, the error was resolved with a hard reload of the browser. However, the issue with the ranking persists, so I would appreciate any advice you can provide.

Creating an index for English provides the expected ranking.

$ pagefind_extended --force-language en
  1. Pages that are longer and contain more keywords. score: 25.417072, words: (29) [1, 3, 13, 16, 20, 24, 33, 92, 101, 113, 118, 133, 158, 160, 232, 275, 354, 363, 374, 378, 393, 401, 409, 415, 419, 421, 423, 425, 427], word_count: 428
  2. Pages that are shorter and contain fewer keywords. score: 19.279419, words: (13) [0, 7, 11, 24, 25, 30, 36, 40, 45, 46, 47, 51, 53], word_count: 55

But, when creating an index for Japanese, the order does not match the expected one.

$ pagefind_extended --force-language ja
  1. Pages that are shorter and contain fewer keywords. score: 7.1849093, words: (8) [287, 290, 312, 331, 351, 405, 413, 419], word_count: 469
  2. Pages that are longer and contain more keywords. score: 2.3622553, words: [1633], word_count: 2485

Since the number of word hits is noticeably low, this might not be an issue with ranking customization, but rather with the Japanese word segmentation. Is there a way to debug the results of word segmentation in detail? Any advice on this would be greatly appreciated.

hamano avatar Apr 05 '24 16:04 hamano

Interesting! If you have a test page to share I'm happy to help look into it :)

Is there a way to debug the results of word segmentation in detail?

Currently you can look at the zero-width space characters in the raw_content field returned with the Pagefind fragment. For extended languages, Pagefind doesn't split on standard whitespace, and instead splits on these zero-width spaces that it inserts after segmentation.

For a quick example, you can replace the \u200B zero-width space character and log the result, e.g.:

result.raw_content.replace(/\u200B/g, '๐Ÿ•')

Which will output something like (testing on https://starlight.astro.build/ja/):

Starlight๐Ÿ•ใ‚ทใƒงใƒผใ‚ฑใƒผใ‚น๐Ÿ•. ๐Ÿ•่‡ชๅˆ†๐Ÿ•ใฎ๐Ÿ•ใ‚‚ใฎ๐Ÿ•ใ‚’๐Ÿ•่ฟฝๅŠ ๐Ÿ•ใ—ใ‚ˆ๐Ÿ•ใ†๐Ÿ•๏ผ ๐Ÿ•Starlight๐Ÿ•ใง๐Ÿ•ใ‚ตใ‚คใƒˆ๐Ÿ•ใ‚’๐Ÿ•ไฝœๆˆ๐Ÿ•ใ—๐Ÿ•ใพใ—๐Ÿ•ใŸ๐Ÿ•ใ‹๐Ÿ•๏ผŸ๐Ÿ•ใ“ใฎ๐Ÿ•ใƒšใƒผใ‚ธ๐Ÿ•ใซ๐Ÿ•ใƒชใƒณใ‚ฏ๐Ÿ•ใ‚’๐Ÿ•่ฟฝๅŠ ๐Ÿ•ใ™ใ‚‹๐Ÿ•PR๐Ÿ•ใ‚’๐Ÿ•ไฝœๆˆ๐Ÿ•ใ—๐Ÿ•ใพใ—ใ‚‡๐Ÿ•ใ†๐Ÿ•๏ผ ๐Ÿ•ใ‚ตใ‚คใƒˆ๐Ÿ•. ๐Ÿ•Starlight๐Ÿ•ใฏ๐Ÿ•ใ™ใงใซ๐Ÿ•ๆœฌ็•ช๐Ÿ•็’ฐๅขƒ๐Ÿ•ใง๐Ÿ•ไฝฟ็”จ๐Ÿ•ใ•๐Ÿ•ใ‚Œ๐Ÿ•ใฆ๐Ÿ•ใ„๐Ÿ•ใพใ™๐Ÿ•ใ€‚๐Ÿ•ไปฅไธ‹๐Ÿ•ใฏ๐Ÿ•ใ€๐Ÿ•ใ‚ฆใ‚งใƒ–๐Ÿ•ไธŠ๐Ÿ•ใฎ๐Ÿ•ใ„ใใค๐Ÿ•ใ‹๐Ÿ•ใฎ๐Ÿ•ใ‚ตใ‚คใƒˆ๐Ÿ•ใงใ™๐Ÿ•ใ€‚ ๐Ÿ•Athena ๐Ÿ•OS๐Ÿ•. ๐Ÿ•PubIndexAPI ๐Ÿ•Docs๐Ÿ•. ๐Ÿ•pls๐Ÿ•. ๐Ÿ•capo.js๐Ÿ•. ๐Ÿ•Web ๐Ÿ•Monetization ๐Ÿ•API๐Ÿ•. ๐Ÿ•QBCore ๐Ÿ•Docs๐Ÿ•. ๐Ÿ•har.fyi๐Ÿ•. ๐Ÿ•xs๐Ÿ•-๐Ÿ•dev ๐Ÿ•docs๐Ÿ•. ๐Ÿ•Felicity๐Ÿ•. ๐Ÿ•NgxEditor๐Ÿ•. ๐Ÿ•Astro ๐Ÿ•Error ๐Ÿ•Pages๐Ÿ•. ๐Ÿ•Terrateam ๐Ÿ•Docs๐Ÿ•. ๐Ÿ•simple๐Ÿ•-๐Ÿ•fm๐Ÿ•. ๐Ÿ•Obytes ๐Ÿ•Starter๐Ÿ•. ๐Ÿ•Kanri๐Ÿ•. ๐Ÿ•VRCFR ๐Ÿ•Creator๐Ÿ•. ๐Ÿ•Refact๐Ÿ•. ๐Ÿ•Some ๐Ÿ•drops ๐Ÿ•of ๐Ÿ•PHP ๐Ÿ•Book๐Ÿ•. ๐Ÿ•Nostalgist.js๐Ÿ•. ๐Ÿ•AI ๐Ÿ•Prompt ๐Ÿ•Snippets๐Ÿ•. ๐Ÿ•Folks ๐Ÿ•Router๐Ÿ•. ๐Ÿ•React ๐Ÿ•Awesome ๐Ÿ•Reveal๐Ÿ•. ๐Ÿ•Ethereum ๐Ÿ•Follow ๐Ÿ•Protocol๐Ÿ•. ๐Ÿ•Knip๐Ÿ•. ๐Ÿ•secco๐Ÿ•. ๐Ÿ•SiteOne ๐Ÿ•Crawler๐Ÿ•. ๐Ÿ•csmos๐Ÿ•. ๐Ÿ•TanaFlows ๐Ÿ•Docs๐Ÿ•. ๐Ÿ•Concepto ๐Ÿ•AI๐Ÿ•. ๐Ÿ•Mr๐Ÿ•. ๐Ÿ•Robรธt๐Ÿ•. ๐Ÿ•Open ๐Ÿ•SaaS ๐Ÿ•Docs๐Ÿ•. ๐Ÿ•Astro ๐Ÿ•Snipcart๐Ÿ•. ๐Ÿ•Astro๐Ÿ•-๐Ÿ•GhostCMS๐Ÿ•. ๐Ÿ•oneRepo๐Ÿ•. ๐Ÿ•Flojoy๐Ÿ•. ๐Ÿ•AstroNvim๐Ÿ•. ๐Ÿ•ScreenshotOne ๐Ÿ•Docs๐Ÿ•. ๐Ÿ•DipSway๐Ÿ•. ๐Ÿ•RunsOn๐Ÿ•. ๐Ÿ•SudoVanilla๐Ÿ•. ๐Ÿ•SST ๐Ÿ•Ion๐Ÿ•. ๐Ÿ•Font ๐Ÿ•Awesome๐Ÿ•. ๐Ÿ•Starlight๐Ÿ•ใ‚’๐Ÿ•ไฝฟ็”จ๐Ÿ•ใ—๐Ÿ•ใฆ๐Ÿ•ใ„ใ‚‹๐Ÿ•ใƒ‘ใƒ–ใƒชใƒƒใ‚ฏ๐Ÿ•ใช๐Ÿ•ใƒ—ใƒญใ‚ธใ‚งใ‚ฏใƒˆ๐Ÿ•ใฎ๐Ÿ•GitHub๐Ÿ•ใƒชใƒใ‚ธใƒˆใƒช๐Ÿ•ใ‚’๐Ÿ•็ขบ่ช๐Ÿ•ใ—๐Ÿ•ใฆ๐Ÿ•ใฟ๐Ÿ•ใฆ๐Ÿ•ใใ ใ•ใ„๐Ÿ•ใ€‚

With that you can see how the words were segmented โ€” this only works for the "extended" languages such as ja / zh.

bglw avatar Apr 09 '24 23:04 bglw

It seems that there is indeed an issue with the segmentation of Japanese. Here is an example of such content:

<span>OpenSSL</span>
<span>OpenSsl</span>

Creating an index for this content with --force-language en and searching for "ssl" yields the expected 2 hits.

words: (2) [0, 1]
word_count: 2
raw_content: "OpenSSL OpenSsl"

However, with --force-language ja, it is segmented as follows, and only "OpenSsl" hits for "ssl".

words: [5]
word_count: 6
raw_content.replace(/\u200B/g, '|'): "Open|S|S|L |Open|Ssl"

It appears that words like "OpenSSL" are not being correctly segmented.

hamano avatar Apr 16 '24 15:04 hamano

@bglw There was something concerning about the words you provided as an example.

๐Ÿ•Astro๐Ÿ•-๐Ÿ•GhostCMS๐Ÿ•
<span>Astro-GhostCMS</span>

This content is segmented in my environment as follows:

|Astro|-|Ghost|C|M|S|

What kind of differences in the environment are there?

hamano avatar Apr 17 '24 13:04 hamano

It seems this is an issue caused by charabia.

main.rs:

use std::env;
use charabia::Segment;

fn main() {
    let arg = env::args().nth(1).unwrap();
    let segments = arg.as_str().segment_str().collect::<Vec<&str>>().join("|");
    println!("{}", segments)
}
$ cargo run "OpenSSL: Cryptography and SSL/TLS Toolkit"
Open|S|S|L|:| |Cryptography| |and| |SSL|/|TLS| |Toolkit
$ cargo run "OpenSSLใฏๆš—ๅทๅŒ–ใจSSL/TLSใฎ็‚บใฎใƒ„ใƒผใƒซใ‚ญใƒƒใƒˆใงใ™ใ€‚"
Open|S|S|L|ใฏ|ๆš—ๅท|ๅŒ–|ใจ|SSL|/|TLS|ใฎ|็‚บ|ใฎ|ใƒ„ใƒผใƒซ|ใ‚ญใƒƒใƒˆ|ใงใ™|ใ€‚

hamano avatar Apr 18 '24 02:04 hamano

I've conducted some inspection on charabia. It seems that the latin-camelcase feature seems to be detrimental.

default feature

$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
Open|S|S|L| |Open|Ssl| |open|Ssl| |open|_|ssl

disable all feature

$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
OpenSSL| |OpenSsl| |openSsl| |open|_|ssl

I think words like camelCase in japanese sentence are proper nouns, so there's no reason to split them. Therefore, I propose disabling the default feature and enabling only Japanese and Chinese.

enable only chinese and japanese feature

$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
OpenSSL| |OpenSsl| |openSsl| |open|_|ssl

$ cargo run "OpenSSLใฏๆš—ๅทๅŒ–ใจSSL/TLSใฎ็‚บใฎใƒ„ใƒผใƒซใ‚ญใƒƒใƒˆใงใ™ใ€‚"
OpenSSL|ใฏ|ๆš—ๅท|ๅŒ–|ใจ|SSL|/|TLS|ใฎ|็‚บ|ใฎ|ใƒ„ใƒผใƒซ|ใ‚ญใƒƒใƒˆ|ใงใ™|ใ€‚

hamano avatar Apr 18 '24 03:04 hamano

Released in v1.1.1 ๐Ÿ™‚

bglw avatar Sep 03 '24 02:09 bglw