sacremoses
sacremoses copied to clipboard
Strange behaviour of command-line detokenizer
I can't get the command-line detokenizer to work properly. I have tried this:
$ echo "L'amitié nous a fait forts d'esprit" | sacremoses tokenize -l fr | sacremoses detokenize -l fr
and I get
L & a p o s ; a m i t i é n o u s a f a i t f o r t s d & a p o s ; e s p r i t
What am I doing wrong? Cheers!
Sorry about it, I think it was cause by a mistake in a previous version which was patched in #36
Could you try the latest version pip install -U sacremoses? Should be version 0.0.13 now.
It should work now:
alvas@ubi:~$ echo "L'amitié nous a fait forts d'esprit" | sacremoses tokenize -l fr | sacremoses detokenize -l fr
[out]:
L' amitié nous a fait forts d' esprit
Seems like this https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L481 isn't used when iterating... I'll check that first thing tomorrow morning =)
Hmmm, seems like the apostrophe for french isn't working as expected though:
From original moses:
$ echo "L'amitié nous a fait forts d'esprit" | ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr | ~/mosesdecoder/scripts/tokenizer/detokenizer.perl -l fr
Detokenizer Version $Revision: 4134 $
Language: fr
Tokenizer Version 1.1
Language: fr
Number of threads: 1
L'amitié nous a fait forts d'esprit
Wow, that was fast. Yes, apostrophes don't look good when detokenized (they are separated with spaces).
I'll be grateful if you let me know of any progress.
(1) Have you had a chance to solve the problem with spaces when detokenizing, @alvations ?
(2) Also, apparently, there is a way to specify the language when creating the tokenizer.
from sacremoses import MosesTokenizer
mt=MosesTokenizer(lang="fr")
it would be nice to document this in the README.md. By the way, when the language is not "en", "it" or "fr", but you specify it, the apostrophes are doubly escaped with backslashes for a reason that escapes me:
mt=MosesTokenizer(lang="es")
print(mt.tokenize("Un texto con 'comillas' para probar"))
produces
['Un', 'texto', 'con', '\\'', 'comillas', '\\'', 'para', 'probar']
where it would be more appropriate to have (as with lang="en")
['Un', 'texto', 'con', ''', 'comillas', ''', 'para', 'probar']
By the way, I could consider offering my help with Catalan ("ca") tokenization. The current French and Italian model partly works, but Catalan has post-verbal pronouns such as
Informa-te'n
which should be tokenized as
Informa -te 'n
But first I'd have to get better acquainted with your code.
Cheers, Mikel
@mlforcada Sorry for the delay!
Now the latest version should have the french apostrophes patched.
from sacremoses.tokenize import MosesTokenizer, MosesDetokenizer
mt = MosesTokenizer(lang='fr')
md = MosesDetokenizer(lang='fr')
md.detokenize(mt.tokenize("L'amitié nous a fait forts d'esprit")) == "L'amitié nous a fait forts d'esprit"
I was catching the end of string symbol in the token after the apostrophes' clitics so that was wrong re.search(u'^[{}]$'.format(self.IsAlpha), tokens[i + 1])), from the original Moses detokenizer, they didn't have it ($words[$i+1] =~ /^[\p{IsAlpha}]/))
Regarding the Spanish escaping of the ampersand, I'm not able to reproduce it, shouldn't be a problem with version >=0.0.13. The latest french patch would be >=0.0.14
Which version of sacremoses are you using?
>>> import sacremoses
>>> sacremoses.__version__
0.0.19
>>> from sacremoses.tokenize import MosesTokenizer, MosesDetokenizer
>>> mt=MosesTokenizer(lang="es")
>>> print(mt.tokenize("Un texto con 'comillas' para probar"))
['Un', 'texto', 'con', ''', 'comillas', ''', 'para', 'probar']
Also having Catalan specific rules would be awesome! I've vested interest for Catalan text processing =)
Do you have a list of rules and words that should prevent weird splitting for Catalan?
Thanks a million, @alvations ! I updated. Sacremoses says now it is 0.0.19. Detokenization for French works as a breeze now! Cheers!
Catalan rules for apostrophes and hyphens with pronouns, articles and prepositions:
Work as in French and italian:
[dlmnts]'WORD → [dlmnts]' WORD
Single pronoun after verb, apostrophe.
VERB'[lmnst] → VERB '[lmnst]
VERB'ns → VERB 'ns
VERB'ls → VERB 'ls
Single pronoun, after verb, with hyphen: VERB-(me|te|lo|la|li|nos|vos|us|los|les|se|ne|hi) → VERB -(me|te|lo|la|li|nos|vos|us|los|les|se|ne|hi)
Two pronouns, after verb, two hyphens (bit overgenerated, but should work)
VERB-(me|te|se|lo|la|li|nos|us|vos|los|les)-(em|el|la|li|en|ens|us|els|les|hi|ho) → VERB -(me|te|se|lo|la|li|nos|us|vos|los|les) -(em|el|la|li|en|ens|us|els|les|hi|ho)
Two pronouns, apostrophe and hyphenated
VERB'(ns|ls)-(el|la|els|les|li|ho|hi|en) → VERB '(ns|ls) -(el|la|els|les|li|ho|hi|en)
Two pronouns, hyphenated and apostrophe
VERB'(me|te|se|li|)-(m|t|s|l|ns|ls) → VERB '(me|te|se|li) -(m|t|s|l|ns|ls)
In this last case, probably the second part could be processed with the single apostrophe rule above.
Thanks again,
@mlforcada
Aggh, the last one is wrong. It should be
VERB-(me|te|se|li|)'(m|t|s|l|ns|ls) → VERB -(me|te|se|li) '(m|t|s|l|ns|ls)
Sorry about that!
Thanks @mlforcada! Let me see how I could convert the rules above =)