sacremoses icon indicating copy to clipboard operation
sacremoses copied to clipboard

Apostrophes in English

Open j0hannes opened this issue 6 years ago • 12 comments

I just reported the same issue to the mosestokenizer package: https://github.com/luismsgomes/mosestokenizer/issues/1

The problem is that detokenization fails to handle apostrophes correcly:

import sacremoses                                                                                                                                     
tokens = 'yesterday ’s reception'.split(' ')                                                                                                          
print(sacremoses.MosesDetokenizer('en').detokenize(tokens))  

prints yesterday ’s reception

j0hannes avatar Oct 30 '19 17:10 j0hannes

Seems to be the behavior of default moses too =(

$ echo "yesterday ’s reception" | perl tokenizer.perl -l en | perl detokenizer.perl 
Detokenizer Version $Revision: 4134 $
Language: en
Tokenizer Version 1.1
Language: en
Number of threads: 1
yesterday ’ s reception


$ echo "yesterday 's reception" | perl tokenizer.perl -l en | perl detokenizer.perl 
Detokenizer Version $Revision: 4134 $
Language: en
Tokenizer Version 1.1
Language: en
Number of threads: 1
yesterday 's reception

But that's because the 's is usually converted into 's during tokenization and detokenization only recognize the 's for the de-spacing.

$ echo "yesterday's reception" | perl tokenizer.perl -l en 
Tokenizer Version 1.1
Language: en
Number of threads: 1
yesterday 's reception

And also when using ’s instead of 's, apostrophe didn't get escape to 's, thus the detokenization didn't work.

$ echo "yesterday ’s reception" | perl tokenizer.perl -l en 
Tokenizer Version 1.1
Language: en
Number of threads: 1
yesterday ’ s reception

In short, you should try to normalize the input and then detokenize it before tokenizing it again and finally detokenize.

That being said, seems like the is not mapping to the right apostrophe in Sacremoses =(

>>> from sacremoses import MosesPunctNormalizer
>>> from sacremoses import MosesTokenizer, MosesDetokenizer

>>> mpn = MosesPunctNormalizer()
>>> mt = MosesTokenizer(lang='en')
>>> md = MosesDetokenizer(lang='en')

>>> text = "yesterday ’s reception"

>>> mpn.normalize(text)
'yesterday "s reception'

>>> mt.tokenize(mpn.normalize(text))
['yesterday', '"', 's', 'reception']

>>> md.detokenize(mt.tokenize(mpn.normalize(text)))
'yesterday "s reception'

Which also happens in Moses' perl script:

$ echo "yesterday ’s reception" | perl normalize-punctuation.perl 
yesterday "s reception

alvations avatar Nov 22 '19 12:11 alvations

The normalization bug in sacremoses happens here:

  • https://github.com/alvations/sacremoses/blob/master/sacremoses/normalize.py#L41 and
  • https://github.com/alvations/sacremoses/blob/master/sacremoses/normalize.py#L43

alvations avatar Nov 22 '19 12:11 alvations

Thanks @j0hannes for catching this, #78 should fix it but it should be rechecked with the Moses decoder repo too.

alvations avatar Nov 22 '19 12:11 alvations

After the #78 fix, your cleaning workflow for your input would be something like:

  1. First normalize your input
  2. Then detokenize it (that's assuming you know that the original input is tokenized)

And if necessary:

  1. Then tokenize it
  2. Finally detokenize it again
>>> from sacremoses import MosesPunctNormalizer
>>> from sacremoses import MosesTokenizer, MosesDetokenizer

>>> mpn = MosesPunctNormalizer()
>>> mt = MosesTokenizer(lang='en')
>>> md = MosesDetokenizer(lang='en')

>>> text = "yesterday ’s reception"
>>> md.detokenize(mt.tokenize(md.detokenize(mpn.normalize(text).split())))
"yesterday's reception"

alvations avatar Nov 22 '19 12:11 alvations

So, to get the detokenized version of my text, and not the detokenized version of the normalized text, I would need to do perform double detokenization, find out which spaces got removed by the detokenizer, and remove those spaces from my original text. Otherwise, I would end up with something which is different to the original text.

j0hannes avatar Nov 22 '19 14:11 j0hannes

Yes, the example I gave is one of the typical pipeline that people use to clean the data for machine translation.

What's the expected output of in your example? Do you want to detokenize or tokenize or normalize?

alvations avatar Nov 22 '19 14:11 alvations

I want to detokenize, without any changes to the text.

j0hannes avatar Nov 22 '19 14:11 j0hannes

Ah, do you mean something like:

>>> from sacremoses import MosesDetokenizer

>>> md = MosesDetokenizer(lang='en')

>>> text = "yesterday 's reception"
>>> md.detokenize(text.split())
"yesterday's reception"

But with the non-standard apostrophe:

>>> text = "yesterday ’s reception"
>>> md.detokenize(text.split())
'yesterday ’s reception'

alvations avatar Nov 25 '19 01:11 alvations

Actually, this part on adding new apostrophe to the detokenization process isn't simple, https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L678

Because:

  • There's some smart quote counting happening
  • And the de-spacing of apostrophe might be language dependent

I'll suggest, bearing with the normalization of the apostrophe instead:

from sacremoses import MosesPunctNormalizer
from sacremoses import MosesTokenizer, MosesDetokenizer

mpn = MosesPunctNormalizer()
md = MosesDetokenizer(lang='en')

text = "yesterday ’s reception"
md.detokenize(mpn.normalize(text).split())

[out]:

yesterday's reception

alvations avatar Nov 25 '19 02:11 alvations

Sorry, this is not an option. I think I'll either try to embed the detokenizer, so that it returns an abstract representation of removed spaces that I can apply to the original text, or maybe there's some software out there that can do detokenization without touching the text. Are there many cases in which this function destroys the text (apart from apostrophes and probably also quotation marks)?

j0hannes avatar Nov 25 '19 07:11 j0hannes

Maybe try https://sjmielke.com/papers/tokenize/ or spacy for your use-case.

I can take a look at this again without changing detokenization behavior but no promises. Because to support non-normalized text opens a can of worms to support all other non-normalized forms =(

alvations avatar Nov 25 '19 08:11 alvations

I had seen that, but it's overly simplistic and language-agnostic, it can't possibly get the job done in all languages. An enhancement that would allow me to use the sacremoses detokenizer would be to have it return length and offset of each part of the string that would be removed instead of a string with those parts already removed.

j0hannes avatar Nov 25 '19 09:11 j0hannes