adapt icon indicating copy to clipboard operation
adapt copied to clipboard

Tokenizer Internationalization - Spanish

Open clusterfudge opened this issue 9 years ago • 11 comments

We should test to see if the EnglishTokenizer impl is sufficient for Spanish, and if not, add an additional tokenizer. EnglishTokenizer is based on porter stemmer.

clusterfudge avatar Jan 08 '16 17:01 clusterfudge

What is needed in order to test it? I am not familiar with adapt's design... and I am reading the README.md at this moment... should I translate the strings or something else is needed?

ghost avatar Jan 08 '16 20:01 ghost

First, you'll need to validate whether or not the EnglishTokenizer is sufficient. I would do this by creating spanish versions of the examples and playing with them. Specifically, the tokenizer is punctuation aware and splits an utterance (sentence or phrase) into individual tokens (usually words).

If the english tokenizer does not work well, you'll need to look for an equivalent to the Porter Stemmer algorithm for Spanish and implement it. The latter can be picked up by someone else, if that's beyond your scope. Validating whether or not the existing tokenizer is sufficient is a great first step.

Thanks!

seanfitzgeraldsc avatar Jan 08 '16 20:01 seanfitzgeraldsc

I see, I am willing to do this, I can't at this very moment... but I will do some experiments later. Expect to read many questions because it's very likely I am getting lost!

cheers!

ghost avatar Jan 08 '16 21:01 ghost

Hi, if you need help to reimplement the Porter Stemmer algorithm for Spanish or other languages take a look at https://github.com/OleanderSoftware/OleanderStemmingLibrary It's a very good lib.

mcicolella avatar Jan 09 '16 10:01 mcicolella

I do not know if I did what's is supposed to do, but I've just modified the source code of the multi_intent_parser.py to "understand" spanish words. http://pastebin.com/bEJqCKuj

You can try those sentences: "pon algo de música de los clash", "quiero escuchar algo de música de los clash", "qué tiempo hace en seattle", and it seems it returns a JSON.

That's whats its needed?

adocampo avatar Feb 28 '16 18:02 adocampo

So, this is definitely some helpful work! I think we'd want to have samples per language, maybe separated by folders. To really verify that this stuff works for spanish, we'd need the unit tests translated to spanish, and even better, localization work done on the unit tests so that the language stays the same, but they load different data files for different languages. That would give me high confidence that the language itself works with the tokenizer, but that may be an unrealistic goal. Can you try translating some of the engine tests?

clusterfudge avatar Feb 29 '16 18:02 clusterfudge

thanks for contributing!

clusterfudge avatar Feb 29 '16 18:02 clusterfudge

Can you try translating some of the engine tests?

Of course I can... could you please point me to the engines? I only saw this one https://github.com/MycroftAI/adapt/blob/master/test/IntentEngineTest.py and I doubt I can do something with it...

adocampo avatar Mar 03 '16 08:03 adocampo

That would be the test I was referencing. Swapping out the vocabulary/utterances for spanish equivalents would be acceptable to me, but completely unverifiable (as I only took about 2 years of spanish, 20 years ago).

clusterfudge avatar Mar 03 '16 08:03 clusterfudge

Ok, I only translated the utterance sentence (line 36) and the two expressions "tree" (line 34) and "house" (line 43) http://pastebin.com/PkZJ4Gmq

I don't know if this is what you need, and perhaps the utterance sentence can be translated into spanish different depending if it is imperative (as I've translated it), infinitive or other tense...

Hope it helps!

adocampo avatar Mar 03 '16 09:03 adocampo

Should I open a new issue for Portuguese?

drawveloper avatar May 21 '16 10:05 drawveloper