Word tokenizer does not split apostrophe and apostrophe s
Is it possible that the word tokenizer does not split off apostrophe and apostrophe s: E.g. Toyota's is considered a single token as opposed to being split into Toyota and 's
This has caused me quite a bit of headache. Would it not be more common to split these?
Hi @pwichmann - have you seen the --split-contractions option here?
https://github.com/fnl/segtok/blob/master/segtok/tokenizer.py#L344
Or the public split_contractions function to post-process tokens if you are using this programmatically, here?
https://github.com/fnl/segtok/blob/master/segtok/tokenizer.py#L122
If you have, can you be specific about what isn't working for you when using that functionality?
I had not seen this. High likelihood of the user (me) being the problem, not the software. Will investigate.