recasepunc icon indicating copy to clipboard operation
recasepunc copied to clipboard

Words with ' are split on tokenization step

Open marlon-br opened this issue 3 years ago • 2 comments

Hello, I have tested French model and in general it works great.

One issue for me is on tokenization step. The words with ' are split on 2, so l'empire turns into l' and empire or c'était turns onto c' and était. Is that expected behavior and what is a was to join such words back into one (expect just checking for ' )?

Thanks!

marlon-br avatar Nov 19 '21 12:11 marlon-br

We miss a tokenizer that preserves offsets from the source text in order to insert punctuation without altering the text. Currently, a set of rules is applied for detokenization, and they dont’t remove the space after single quotes.

For now, you can apply your own rewriting rules as preprocessing. We hope to be able to do better in the future.

benob avatar Nov 19 '21 13:11 benob

Sure, thanks for the quick answer

marlon-br avatar Nov 19 '21 13:11 marlon-br