languagetool
languagetool copied to clipboard
[pt] Improving disambiguator verb related - 2022-07-10
Hello @jaumeortola
I have come up with a very useful and powerful POS information for the disambiguator.
Look at this:
<token postag='VMIP3S0|VMM02S0' postag_regexp='yes'/>
<token regexp='yes' spacebefore='no'>&tracos_de_separacao;</token>
<token regexp='yes' spacebefore='no'>[ao]s?|l[ao]s?|lh[aeo]s?|n[ao]s?|me|se|te|vos</token>
Could I ask you when you have the time to use one POS for:
<token regexp='yes' spacebefore='no'>&tracos_de_separacao;</token>
<token regexp='yes' spacebefore='no'>[ao]s?|l[ao]s?|lh[aeo]s?|n[ao]s?|me|se|te|vos</token>
This would avoid creating two rules for verbs with “verb + - + lhe/te/blah blah”.
Still not sure on the name for the POS, but it will need to refer to the hyphen, plus the contraction, something like:
"Quero-lhes dizer."
Would generate:
VMIP1S0
_HYPHEN:PP3CPD00
<plus other POSes in the sentence>
I don't fully understand what are you asking for. One POS tag for two tokens is not possible.
What we do in other languages (French, Catalan) is to tokenize this way: <token>Quero</token><token>-lhes</token>
. This could be done, but it requires changing the tokenizer and adjusting all rules for these patterns.
@jaumeortola
Ahhhh...
Does that mean that I can use "-" in normal tokens?
Isn't it a wildcard?
<token regexp='yes' spacebefore='no'>&tracos_de_separacao;</token>
<token regexp='yes' spacebefore='no'>[ao]s?|l[ao]s?|lh[aeo]s?|n[ao]s?|me|se|te|vos</token>
replacing with this works?:
<token regexp='yes' spacebefore='no'>-[ao]s?|-l[ao]s?|-lh[aeo]s?|-n[ao]s?|-me|-se|-te|-vos</token>
Ahhhh… I have been reflecting…
What if the disambiguator could have:
<token regexp='yes'>-[ao]s?|-l[ao]s?|-lh[aeo]s?|-n[ao]s?|-me|-se|-te|-vos</token>
and output:
_HYPHEN:PP3CPD00
_HYPHEN:PP blah blah
etc. ?
Thank you!