charabia latin-camelcase feature make wrong segmentation

The default featre has issue with proper noun segmentation like OpenSSL.

main.rs:

use std::env;
use charabia::Segment;

fn main() {
    let arg = env::args().nth(1).unwrap();
    let segments = arg.as_str().segment_str().collect::<Vec<&str>>().join("|");
    println!("{}", segments)
}

default feature:

$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
Open|S|S|L| |Open|Ssl| |open|Ssl| |open|_|ssl

disable default feature

$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
OpenSSL| |OpenSsl| |openSsl| |open|_|ssl

May 01 '24 00:05 hamano

Hello @hamano, What do you expect in terms of segmentation?

Thank you!

Jun 03 '24 06:06 ManyTheFish

~~It is ideal that proper nouns are not split, but if it is not in the dictionary, it can't avoided that OpenSSL would be split as Open|SSL.~~ Here, I recognize the issue as splitting it as Open|S|S|L. The word SSL would disappear.

Jun 04 '24 01:06 hamano

The inability to segment words not found in the dictionary was a characteristic in Japanese. Please ignore that. In any case, I believe it is desirable that 'OpenSSL' not be segment, even in Latin languages

Jun 04 '24 01:06 hamano

https://github.com/CloudCannon/pagefind/issues/591

Jun 04 '24 06:06 hamano

The word open_ssl is not common, but for example, in the case of a module like apache mod_ssl, it is expected not to be segmented.

Jun 04 '24 06:06 hamano

Hello @hamano, I think the better way to solve this issue is to add the word SSL to the tokenizer dictionary. It seems to be a specific case more than a generality.

Jun 04 '24 08:06 ManyTheFish

I'm concerned not just about the term "OpenSSL" but about countless similar terms.

OpenVPN
OpenBSD
OpenJDK
OpenCV
FreeRADIUS
FreeBSD
PostgreSQL
MySQL
MongoDB

I don't think all these terms should be added to the dictionary. New terms emerge one after another.

Jun 05 '24 01:06 hamano

If you say this is the expected behavior of the latin-camelcase feature, then that's fine. I'll disable it. However, it's too inconvenient as a default feature, so I reported it, especially for use in technical documentation.

Jun 05 '24 02:06 hamano

Understood, we may disable it from the default features, thinking of it, this segmentation is mainly used for the cases you are describing, so if you want to change the behavior of the segmenter, I would gladly accept a new PR. Moreover, the change seems easy, maybe replacing the highlighted code block by:

        let should_group = if last_char_was_lowercase && char.is_letter_uppercase() {
            false
        } else {
            true
        };

        last_char_was_lowercase = char.is_letter_lowercase();
        should_group

or even

        let should_group = !(last_char_was_lowercase && char.is_letter_uppercase());
        last_char_was_lowercase = char.is_letter_lowercase();
        should_group

Should solve your issue ☺️

However, the word OpenSSLError would be segmented as ["Open", "SSLError"] with this change, it would need a bit more efforts to make it segment like ["Open", "SSL", "Error"]. 🤔

Jun 05 '24 07:06 ManyTheFish

I am unsure how "OpenSSLError" should be segment. Perhaps in the context of a program constant name appearing in documentation, it is expected not to be segment. Anyway, I don't understand the use case for the latin-camelcase feature. So, I will simply disable it. What I wanted to report here is that the excessive segmentation of "OpenSSL" into "Open|S|S|L" is likely not expected by anyone.

Jun 06 '24 01:06 hamano

Hey @hamano, no problem. I just wanted to say that if you want to use the feature, you can create a PR enhancing the behavior and I will accept it. See you!

Jun 06 '24 07:06 ManyTheFish

charabia charabia copied to clipboard

latin-camelcase feature make wrong segmentation

charabia
charabia copied to clipboard