cg3 icon indicating copy to clipboard operation
cg3 copied to clipboard

cg-proc -w mangles case

Open TinoDidriksen opened this issue 7 years ago • 3 comments

See https://groups.google.com/forum/#!topic/constraint-grammar/pLpCsu-eUY4

I think this goes to @unhammer

TinoDidriksen avatar Dec 21 '18 10:12 TinoDidriksen

I think this one will be hard to deal with correctly without abandoning the "lemma-casing→form-casing" mechanism (where casing is {lower,title,upper}).

That is, in Apertium we currently completely drop source forms while encoding the casing of the source form on the source lemma, so Je/prpers<prn> becomes Prpers<prn>. But casing is ambiguous with single-letter forms (is U title or upper?), so U/prpers becomes PRPERS when it might as well have been Prpers. Translating U into you then might turn it into YOU instead of You.

Some ideas:

  1. Keep source form throughout the pipeline, letting e.g. transfer rules decide what to do. This would be nice for other reasons as well, but would require changes to many Apertium modules, and you'd still have to deal with casing-ambiguity at some point.
  2. ADD (@casinghint) in CG where you have access to both the form and lemma casing (by regexp matching) and part of speech, and transfer upper-/lowercase based on the CG tag you added. I think this would be the easiest solution for now, and it lets you do things like disambiguate U being upper vs title based on if the following word is allcaps or not.

unhammer avatar Dec 21 '18 20:12 unhammer

@MarcRiera ^

unhammer avatar Dec 21 '18 20:12 unhammer

Thanks for the ideas! After experimenting a bit with CG, I have managed to overcome the limitation for both languages, with two solutions that are pair-independent (everything is corrected in the language's CG after-section):

  • For Romanian "A" and "O", a new reading with the correct case is added if the word appears at BOS, and the original incorrect reading is removed. For non-BOS occurrences, the uppercase reading is kept.
  • For English "I", the previous solution does not seem to work for non-BOS cases, because even if a lowercase reading can be inserted, the -w flag changes it to uppercase again. My first workaround was to change the surface form to "prpers" by inserting a new cohort and removing the old one, but this could negatively affect the tagger (which also has access to surface forms). So the solution for BOS occurrences has been, as @unhammer mentioned, to add an extra tag that is recognised during transfer in the English-Catalan pair.

marcriera avatar Dec 23 '18 18:12 marcriera