recasepunc Words with ' are split on tokenization step

Words with ' are split on tokenization step

Open marlon-br opened this issue 3 years ago • 2 comments

Hello, I have tested French model and in general it works great.

One issue for me is on tokenization step. The words with ' are split on 2, so l'empire turns into l' and empire or c'était turns onto c' and était. Is that expected behavior and what is a was to join such words back into one (expect just checking for ' )?

Thanks!

Nov 19 '21 12:11 marlon-br

We miss a tokenizer that preserves offsets from the source text in order to insert punctuation without altering the text. Currently, a set of rules is applied for detokenization, and they dont’t remove the space after single quotes.

For now, you can apply your own rewriting rules as preprocessing. We hope to be able to do better in the future.

Nov 19 '21 13:11 benob

Sure, thanks for the quick answer

Nov 19 '21 13:11 marlon-br

recasepunc recasepunc copied to clipboard

Words with ' are split on tokenization step

recasepunc
recasepunc copied to clipboard