sacremoses
sacremoses copied to clipboard
Apostrophes in English
I just reported the same issue to the mosestokenizer package: https://github.com/luismsgomes/mosestokenizer/issues/1
The problem is that detokenization fails to handle apostrophes correcly:
import sacremoses
tokens = 'yesterday ’s reception'.split(' ')
print(sacremoses.MosesDetokenizer('en').detokenize(tokens))
prints yesterday ’s reception
Seems to be the behavior of default moses too =(
$ echo "yesterday ’s reception" | perl tokenizer.perl -l en | perl detokenizer.perl
Detokenizer Version $Revision: 4134 $
Language: en
Tokenizer Version 1.1
Language: en
Number of threads: 1
yesterday ’ s reception
$ echo "yesterday 's reception" | perl tokenizer.perl -l en | perl detokenizer.perl
Detokenizer Version $Revision: 4134 $
Language: en
Tokenizer Version 1.1
Language: en
Number of threads: 1
yesterday 's reception
But that's because the 's is usually converted into 's during tokenization and detokenization only recognize the 's for the de-spacing.
$ echo "yesterday's reception" | perl tokenizer.perl -l en
Tokenizer Version 1.1
Language: en
Number of threads: 1
yesterday 's reception
And also when using ’s instead of 's, apostrophe didn't get escape to 's, thus the detokenization didn't work.
$ echo "yesterday ’s reception" | perl tokenizer.perl -l en
Tokenizer Version 1.1
Language: en
Number of threads: 1
yesterday ’ s reception
In short, you should try to normalize the input and then detokenize it before tokenizing it again and finally detokenize.
That being said, seems like the ’ is not mapping to the right apostrophe in Sacremoses =(
>>> from sacremoses import MosesPunctNormalizer
>>> from sacremoses import MosesTokenizer, MosesDetokenizer
>>> mpn = MosesPunctNormalizer()
>>> mt = MosesTokenizer(lang='en')
>>> md = MosesDetokenizer(lang='en')
>>> text = "yesterday ’s reception"
>>> mpn.normalize(text)
'yesterday "s reception'
>>> mt.tokenize(mpn.normalize(text))
['yesterday', '"', 's', 'reception']
>>> md.detokenize(mt.tokenize(mpn.normalize(text)))
'yesterday "s reception'
Which also happens in Moses' perl script:
$ echo "yesterday ’s reception" | perl normalize-punctuation.perl
yesterday "s reception
The normalization bug in sacremoses happens here:
- https://github.com/alvations/sacremoses/blob/master/sacremoses/normalize.py#L41 and
- https://github.com/alvations/sacremoses/blob/master/sacremoses/normalize.py#L43
Thanks @j0hannes for catching this, #78 should fix it but it should be rechecked with the Moses decoder repo too.
After the #78 fix, your cleaning workflow for your input would be something like:
- First normalize your input
- Then detokenize it (that's assuming you know that the original input is tokenized)
And if necessary:
- Then tokenize it
- Finally detokenize it again
>>> from sacremoses import MosesPunctNormalizer
>>> from sacremoses import MosesTokenizer, MosesDetokenizer
>>> mpn = MosesPunctNormalizer()
>>> mt = MosesTokenizer(lang='en')
>>> md = MosesDetokenizer(lang='en')
>>> text = "yesterday ’s reception"
>>> md.detokenize(mt.tokenize(md.detokenize(mpn.normalize(text).split())))
"yesterday's reception"
So, to get the detokenized version of my text, and not the detokenized version of the normalized text, I would need to do perform double detokenization, find out which spaces got removed by the detokenizer, and remove those spaces from my original text. Otherwise, I would end up with something which is different to the original text.
Yes, the example I gave is one of the typical pipeline that people use to clean the data for machine translation.
What's the expected output of in your example? Do you want to detokenize or tokenize or normalize?
I want to detokenize, without any changes to the text.
Ah, do you mean something like:
>>> from sacremoses import MosesDetokenizer
>>> md = MosesDetokenizer(lang='en')
>>> text = "yesterday 's reception"
>>> md.detokenize(text.split())
"yesterday's reception"
But with the non-standard apostrophe:
>>> text = "yesterday ’s reception"
>>> md.detokenize(text.split())
'yesterday ’s reception'
Actually, this part on adding new apostrophe to the detokenization process isn't simple, https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L678
Because:
- There's some smart quote counting happening
- And the de-spacing of apostrophe might be language dependent
I'll suggest, bearing with the normalization of the apostrophe instead:
from sacremoses import MosesPunctNormalizer
from sacremoses import MosesTokenizer, MosesDetokenizer
mpn = MosesPunctNormalizer()
md = MosesDetokenizer(lang='en')
text = "yesterday ’s reception"
md.detokenize(mpn.normalize(text).split())
[out]:
yesterday's reception
Sorry, this is not an option. I think I'll either try to embed the detokenizer, so that it returns an abstract representation of removed spaces that I can apply to the original text, or maybe there's some software out there that can do detokenization without touching the text. Are there many cases in which this function destroys the text (apart from apostrophes and probably also quotation marks)?
Maybe try https://sjmielke.com/papers/tokenize/ or spacy for your use-case.
I can take a look at this again without changing detokenization behavior but no promises. Because to support non-normalized text opens a can of worms to support all other non-normalized forms =(
I had seen that, but it's overly simplistic and language-agnostic, it can't possibly get the job done in all languages. An enhancement that would allow me to use the sacremoses detokenizer would be to have it return length and offset of each part of the string that would be removed instead of a string with those parts already removed.