nlpo3 icon indicating copy to clipboard operation
nlpo3 copied to clipboard

Clarify method in Tokenizer trait

Open wingyplus opened this issue 4 years ago • 5 comments
trafficstars

From Tokenizer trait declaration:

pub trait Tokenizer {
    fn segment(&self, text: &str, safe: Option<bool>, parallel: Option<bool>) -> Vec<String>;

    fn segment_to_string(
        &self,
        text: &str,
        safe: Option<bool>,
        parallel: Option<bool>,
    ) -> Vec<String>;
}

It provides 2 methods that look do the same thing. And in Newmm, it do the same things for both methods (Ref here). I purpose to clarify which one we should use and clean up trait protocol.

wingyplus avatar Aug 11 '21 13:08 wingyplus

It should be addressed in #39

Gorlph avatar Aug 11 '21 14:08 Gorlph

I'm still unclear which one we should use between segment_to_string and segment since it just differences only unwrap. :(

wingyplus avatar Aug 13 '21 11:08 wingyplus

Does it confused because both of them return string?

bact avatar Aug 14 '21 06:08 bact

Yes it is.

wingyplus avatar Aug 14 '21 06:08 wingyplus

I suggest renaming them.

segment and segment_to_string, at first, were different - segment used to return an array of utf-8 bytes while segment_to_string returns an array of Rust String for at-the-time overhead problem and testing, respectively.

After the overhead problem had been solved, there was no longer a need to return bytes anymore, so I lazily changed segment to return an array of String.

With new signature change as of #39, I intended to make one return array of String, while the other explicitly return Result - to let developers of bindings handle it.

In the future, IMO, it will be better to remove the method which returns Vec<String> altogether.

The current version abuses the use of panic! very liberally where proper idiomatic error handling should be applied - such as Runtime error (look at newmm_custom and count panic!). I would like to keep panic! at initialization, though.

Gorlph avatar Aug 14 '21 06:08 Gorlph