sacremoses icon indicating copy to clipboard operation
sacremoses copied to clipboard

Strange behaviour of command-line detokenizer

Open mlforcada opened this issue 6 years ago • 11 comments
trafficstars

mlforcada avatar Apr 01 '19 13:04 mlforcada

I can't get the command-line detokenizer to work properly. I have tried this:

$ echo "L'amitié nous a fait forts d'esprit" | sacremoses tokenize -l fr | sacremoses detokenize -l fr

and I get

L & a p o s ; a m i t i é n o u s a f a i t f o r t s d & a p o s ; e s p r i t

What am I doing wrong? Cheers!

mlforcada avatar Apr 01 '19 13:04 mlforcada

Sorry about it, I think it was cause by a mistake in a previous version which was patched in #36

Could you try the latest version pip install -U sacremoses? Should be version 0.0.13 now.

It should work now:

alvas@ubi:~$ echo "L'amitié nous a fait forts d'esprit" | sacremoses tokenize -l fr | sacremoses detokenize -l fr

[out]:

L' amitié nous a fait forts d' esprit

Seems like this https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L481 isn't used when iterating... I'll check that first thing tomorrow morning =)

alvations avatar Apr 01 '19 14:04 alvations

Hmmm, seems like the apostrophe for french isn't working as expected though:

From original moses:

$ echo "L'amitié nous a fait forts d'esprit" | ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr | ~/mosesdecoder/scripts/tokenizer/detokenizer.perl -l fr
Detokenizer Version $Revision: 4134 $
Language: fr
Tokenizer Version 1.1
Language: fr
Number of threads: 1
L'amitié nous a fait forts d'esprit

alvations avatar Apr 01 '19 14:04 alvations

Wow, that was fast. Yes, apostrophes don't look good when detokenized (they are separated with spaces).

mlforcada avatar Apr 01 '19 14:04 mlforcada

I'll be grateful if you let me know of any progress.

mlforcada avatar Apr 01 '19 20:04 mlforcada

(1) Have you had a chance to solve the problem with spaces when detokenizing, @alvations ?

(2) Also, apparently, there is a way to specify the language when creating the tokenizer.

from sacremoses import MosesTokenizer
mt=MosesTokenizer(lang="fr")

it would be nice to document this in the README.md. By the way, when the language is not "en", "it" or "fr", but you specify it, the apostrophes are doubly escaped with backslashes for a reason that escapes me:

mt=MosesTokenizer(lang="es")
print(mt.tokenize("Un texto con 'comillas' para probar"))

produces

['Un', 'texto', 'con', '\\'', 'comillas', '\\'', 'para', 'probar']

where it would be more appropriate to have (as with lang="en")

['Un', 'texto', 'con', ''', 'comillas', ''', 'para', 'probar']

By the way, I could consider offering my help with Catalan ("ca") tokenization. The current French and Italian model partly works, but Catalan has post-verbal pronouns such as

Informa-te'n

which should be tokenized as

Informa -te 'n

But first I'd have to get better acquainted with your code.

Cheers, Mikel

mlforcada avatar Apr 10 '19 16:04 mlforcada

@mlforcada Sorry for the delay!

Now the latest version should have the french apostrophes patched.

from sacremoses.tokenize import MosesTokenizer, MosesDetokenizer
mt = MosesTokenizer(lang='fr')
md = MosesDetokenizer(lang='fr')
md.detokenize(mt.tokenize("L'amitié nous a fait forts d'esprit")) == "L'amitié nous a fait forts d'esprit"

I was catching the end of string symbol in the token after the apostrophes' clitics so that was wrong re.search(u'^[{}]$'.format(self.IsAlpha), tokens[i + 1])), from the original Moses detokenizer, they didn't have it ($words[$i+1] =~ /^[\p{IsAlpha}]/))

alvations avatar Apr 12 '19 01:04 alvations

Regarding the Spanish escaping of the ampersand, I'm not able to reproduce it, shouldn't be a problem with version >=0.0.13. The latest french patch would be >=0.0.14

Which version of sacremoses are you using?

>>> import sacremoses
>>> sacremoses.__version__
0.0.19

>>> from sacremoses.tokenize import MosesTokenizer, MosesDetokenizer
>>> mt=MosesTokenizer(lang="es")
>>> print(mt.tokenize("Un texto con 'comillas' para probar"))

['Un', 'texto', 'con', ''', 'comillas', ''', 'para', 'probar']

Also having Catalan specific rules would be awesome! I've vested interest for Catalan text processing =)

Do you have a list of rules and words that should prevent weird splitting for Catalan?

alvations avatar Apr 12 '19 01:04 alvations

Thanks a million, @alvations ! I updated. Sacremoses says now it is 0.0.19. Detokenization for French works as a breeze now! Cheers!

Catalan rules for apostrophes and hyphens with pronouns, articles and prepositions:

Work as in French and italian:

[dlmnts]'WORD → [dlmnts]' WORD

Single pronoun after verb, apostrophe.

VERB'[lmnst] → VERB '[lmnst]
VERB'ns → VERB 'ns
VERB'ls → VERB 'ls

Single pronoun, after verb, with hyphen: VERB-(me|te|lo|la|li|nos|vos|us|los|les|se|ne|hi) → VERB -(me|te|lo|la|li|nos|vos|us|los|les|se|ne|hi)

Two pronouns, after verb, two hyphens (bit overgenerated, but should work)

VERB-(me|te|se|lo|la|li|nos|us|vos|los|les)-(em|el|la|li|en|ens|us|els|les|hi|ho) → VERB -(me|te|se|lo|la|li|nos|us|vos|los|les) -(em|el|la|li|en|ens|us|els|les|hi|ho)

Two pronouns, apostrophe and hyphenated

VERB'(ns|ls)-(el|la|els|les|li|ho|hi|en) → VERB '(ns|ls) -(el|la|els|les|li|ho|hi|en)

Two pronouns, hyphenated and apostrophe

VERB'(me|te|se|li|)-(m|t|s|l|ns|ls) → VERB '(me|te|se|li) -(m|t|s|l|ns|ls)

In this last case, probably the second part could be processed with the single apostrophe rule above.

Thanks again,

@mlforcada 

mlforcada avatar Apr 12 '19 09:04 mlforcada

Aggh, the last one is wrong. It should be

VERB-(me|te|se|li|)'(m|t|s|l|ns|ls) → VERB -(me|te|se|li) '(m|t|s|l|ns|ls)

Sorry about that!

mlforcada avatar Apr 12 '19 09:04 mlforcada

Thanks @mlforcada! Let me see how I could convert the rules above =)

alvations avatar Apr 13 '19 03:04 alvations