recasepunc
recasepunc copied to clipboard
Words with ' are split on tokenization step
Hello, I have tested French model and in general it works great.
One issue for me is on tokenization step. The words with ' are split on 2, so l'empire turns into l' and empire or c'était turns onto c' and était. Is that expected behavior and what is a was to join such words back into one (expect just checking for ' )?
Thanks!
We miss a tokenizer that preserves offsets from the source text in order to insert punctuation without altering the text. Currently, a set of rules is applied for detokenization, and they dont’t remove the space after single quotes.
For now, you can apply your own rewriting rules as preprocessing. We hope to be able to do better in the future.
Sure, thanks for the quick answer