nlpo3
nlpo3 copied to clipboard
Clarify method in Tokenizer trait
From Tokenizer trait declaration:
pub trait Tokenizer {
fn segment(&self, text: &str, safe: Option<bool>, parallel: Option<bool>) -> Vec<String>;
fn segment_to_string(
&self,
text: &str,
safe: Option<bool>,
parallel: Option<bool>,
) -> Vec<String>;
}
It provides 2 methods that look do the same thing. And in Newmm, it do the same things for both methods (Ref here). I purpose to clarify which one we should use and clean up trait protocol.
It should be addressed in #39
I'm still unclear which one we should use between segment_to_string and segment since it just differences only unwrap. :(
Does it confused because both of them return string?
Yes it is.
I suggest renaming them.
segment and segment_to_string, at first, were different - segment used to return an array of utf-8 bytes while segment_to_string returns an array of Rust String for at-the-time overhead problem and testing, respectively.
After the overhead problem had been solved, there was no longer a need to return bytes anymore, so I lazily changed segment to return an array of String.
With new signature change as of #39, I intended to make one return array of String, while the other explicitly return Result - to let developers of bindings handle it.
In the future, IMO, it will be better to remove the method which returns Vec<String> altogether.
The current version abuses the use of panic! very liberally where proper idiomatic error handling should be applied - such as Runtime error (look at newmm_custom and count panic!). I would like to keep panic! at initialization, though.