languagetool icon indicating copy to clipboard operation
languagetool copied to clipboard

use Morfologik for Portuguese?

Open danielnaber opened this issue 3 years ago • 9 comments

Is there a reason not to use Morfologik for Portuguese? It's faster than hunspell, and most languages in LT use it (except those that have compounds, I think Portuguese doesn't?).

@jaumeortola Do you have an opinion on this?

danielnaber avatar Nov 25 '21 12:11 danielnaber

Certainly, we'll be much better with Morfologik. The only problem that comes to my mind is that of the varieties of language. We have to figure out how many spelling dictionaries we really need and what are the differences.

jaumeortola avatar Nov 25 '21 13:11 jaumeortola

I think we might need two spelling dictionaries for pt-PT and pt-BR.

Some words might not be in use in pt-PT, like xícara (= chávena), others take different accents (metro vs. metrô, ténis vs. tênis). One tagging problem might be verbs ending in -ar in the pretérito perfeito simples, which end in -amos in pt-BR (-ámos in pt-PT), and are homographs of the present tense in pt-BR.

udomai avatar Nov 25 '21 15:11 udomai

Maybe we should just use the same dicts we have now and export them: https://github.com/languagetool-org/languagetool/tree/9d3c36600f369cba03105343b4f0550a016e6cdf/languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/hunspell

danielnaber avatar Nov 25 '21 15:11 danielnaber

Related: https://github.com/languagetool-org/languagetool/issues/2082 https://github.com/languagetool-org/languagetool/issues/199

tiff avatar Nov 25 '21 15:11 tiff

Number of lines (words) in the different Hunspell dictionaries:

12,558,170 pt_PT (~ 5 milions with enclitics: -nos, -vos -te, -se...) 10,545,031 pt_BR (~ 7 milions with enclitics: -te, -lhe, -lhes...) 10,914,777 pt_AO 10,004,463 pt_MZ

Lines in the tagger dictionary: 1,131,147 added.txt: 7067

The number and the distribution of enclitics are surprisingly different in PT and BR. Most common enclitics PT vs. BR:

501396 | -nos | 532194 | -te
99440 | -vos | 531876 | -lhe
99440 | -te | 531857 | -lhes
99440 | -se | 527201 | -vos
99440 | -me | 494055 | -nos
99440 | -lhos | 424264 | -me
99440 | -lho | 379598 | -se
99440 | -lhes | 331421 | -se-lhe
99440 | -lhe | 331407 | -se-lhes
99440 | -lhas | 254694 | -a

In pt_PT there are millions of forms with prefixes (and suffixes) that don't make much sense. See:

acometo
antiacometo
reacometo
biacometo
triacometo
tetraacometo
pentaacometo
hexaacometo
cometo
anticometo
recometo
bicometo
tricometo
tetracometo
pentacometo
hexacometo

For example, one and a half million words (the whole dictionary?) with tetra-:

tetraxenotransplantes
tetraxenotransplantíssimo
tetraxenotransplantíssima
tetraxenotransplantíssimos
tetraxenotransplantíssimas
tetraxenotransplantice
tetraxenotransplantices
tetraxenotransplante

Probably, most of these features (prefixes, suffixes, enclitics...) are not being used currently in LT.

jaumeortola avatar Feb 11 '22 17:02 jaumeortola

That looks interesting! Something is probably wrong with the occurrences of enclitics pt-PT vs. -BR. The numbers must be higher in -PT (since in all but the highest registers, the postponed object pronoun is far less frequent).

udomai avatar Feb 14 '22 16:02 udomai

I have been looking into some problems in the Hunspell Portuguese dictionaries: https://github.com/languagetool-org/languagetool/issues/6298 BTW, in today's and yesterday's nightly diffs there are some unexpected changes: many spelling suggestions have changed, and you don't know why. There are some changes in German as well, but not so many. Any idea about this, @danielnaber? https://internal1.languagetool.org/regression-tests/via-http/2022-09-01/pt-BR/result_java_HUNSPELL_RULE.html https://internal1.languagetool.org/regression-tests/via-http/2022-09-02/pt-BR/result_java_HUNSPELL_RULE.html

Instead of trying to solve these problems, we should convert the spelling dictionaries to Morfologik. Possible obstacles:

  • The changes in tokenization can produce undesired rules. We need to check that thoroughly. Perhaps we'll need additional grammar rules for blind spots.
  • There are many nonsense words in the dictionary (with prefixes, as mentioned in previous messages). This should be cleaned up, but it is not essential. We can live with it.

I would need several days (a whole week?) to do it, with the support of @susanaboatto. I can start a branch and see if it is doable in a reasonable amount of time.

jaumeortola avatar Sep 02 '22 09:09 jaumeortola

The changes in the German speller in today's diff are simply because words have been added to spelling.txt, I think.

danielnaber avatar Sep 02 '22 10:09 danielnaber

I wonder if that's also the reason we have changes in the PT speller. I have been editing spelling.txt thoroughly this week.

susanaboatto avatar Sep 02 '22 10:09 susanaboatto