segtok icon indicating copy to clipboard operation
segtok copied to clipboard

Word tokenizer does not split apostrophe and apostrophe s

Open pwichmann opened this issue 6 years ago • 2 comments

Is it possible that the word tokenizer does not split off apostrophe and apostrophe s: E.g. Toyota's is considered a single token as opposed to being split into Toyota and 's

This has caused me quite a bit of headache. Would it not be more common to split these?

pwichmann avatar Aug 01 '19 13:08 pwichmann

Hi @pwichmann - have you seen the --split-contractions option here? https://github.com/fnl/segtok/blob/master/segtok/tokenizer.py#L344

Or the public split_contractions function to post-process tokens if you are using this programmatically, here? https://github.com/fnl/segtok/blob/master/segtok/tokenizer.py#L122

If you have, can you be specific about what isn't working for you when using that functionality?

fnl avatar Aug 01 '19 18:08 fnl

I had not seen this. High likelihood of the user (me) being the problem, not the software. Will investigate.

pwichmann avatar Aug 01 '19 18:08 pwichmann