segtok
segtok copied to clipboard
Segtok v2 is here: https://github.com/fnl/syntok -- A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features.
I have plain text without any punctuation or sentence stop in German. How can i make the sentence segments with stop.?
Is it possible that the word tokenizer does not split off apostrophe and apostrophe s: E.g. **Toyota's** is considered a _single_ token as opposed to being split into **Toyota** and...
We are seeing a few issues with segtok being over-eager to split quoted sentences with names directly after the quoted section. Ex. "Good morning," said Harry. "Good morning?" asked Harry....
For example: ``` split_contractions(word_tokenizer("OʼHaraʼs"))