languagetool icon indicating copy to clipboard operation
languagetool copied to clipboard

[pt] Improving disambiguator verb related - 2022-07-10

Open marcoagpinto opened this issue 2 years ago • 3 comments

Hello @jaumeortola

I have come up with a very useful and powerful POS information for the disambiguator.

Look at this:

		<token postag='VMIP3S0|VMM02S0' postag_regexp='yes'/>
		<token regexp='yes' spacebefore='no'>&tracos_de_separacao;</token>
		<token regexp='yes' spacebefore='no'>[ao]s?|l[ao]s?|lh[aeo]s?|n[ao]s?|me|se|te|vos</token>

Could I ask you when you have the time to use one POS for:

		<token regexp='yes' spacebefore='no'>&tracos_de_separacao;</token>
		<token regexp='yes' spacebefore='no'>[ao]s?|l[ao]s?|lh[aeo]s?|n[ao]s?|me|se|te|vos</token>

This would avoid creating two rules for verbs with “verb + - + lhe/te/blah blah”.

Still not sure on the name for the POS, but it will need to refer to the hyphen, plus the contraction, something like: "Quero-lhes dizer." Would generate:

VMIP1S0
_HYPHEN:PP3CPD00
<plus other POSes in the sentence>

marcoagpinto avatar Jul 10 '22 00:07 marcoagpinto

I don't fully understand what are you asking for. One POS tag for two tokens is not possible.

What we do in other languages (French, Catalan) is to tokenize this way: <token>Quero</token><token>-lhes</token>. This could be done, but it requires changing the tokenizer and adjusting all rules for these patterns.

jaumeortola avatar Jul 10 '22 10:07 jaumeortola

@jaumeortola

Ahhhh...

Does that mean that I can use "-" in normal tokens?

Isn't it a wildcard?

		<token regexp='yes' spacebefore='no'>&tracos_de_separacao;</token>
		<token regexp='yes' spacebefore='no'>[ao]s?|l[ao]s?|lh[aeo]s?|n[ao]s?|me|se|te|vos</token>

replacing with this works?:

<token regexp='yes' spacebefore='no'>-[ao]s?|-l[ao]s?|-lh[aeo]s?|-n[ao]s?|-me|-se|-te|-vos</token>

marcoagpinto avatar Jul 10 '22 10:07 marcoagpinto

Ahhhh… I have been reflecting…

What if the disambiguator could have: <token regexp='yes'>-[ao]s?|-l[ao]s?|-lh[aeo]s?|-n[ao]s?|-me|-se|-te|-vos</token>

and output:

_HYPHEN:PP3CPD00
_HYPHEN:PP blah blah

etc. ?

Thank you!

marcoagpinto avatar Jul 10 '22 10:07 marcoagpinto