cg3
cg3 copied to clipboard
Allow printing tags without full tracing
Since commit a9c767573edaefb0d59c2c9d93fbf1048d5c92a3, tags are not printed unless tracing is enabled. However, since CG is now used in many Apertium pairs during generation to handle preferences, tags may be necessary without full tracing.
Tags are specially useful when running a testvoc. If no tags are printed, only the internal lemma with # is shown, which is difficult to debug. Enabling tracing with -t helps in this sense, but also adds excessive information that the postgenerator does not handle properly.
I suggest adding a new flag to print tags, regardless of tracing, or printing tags by default again (unless it is really preferable not to print tags by default).
Thanks!
Ping @unhammer
So you're running a pipeline that uses cg-proc -n -g after lt-proc -g and the lt-proc steps which used to give you #foo<tag> now gives #foo?
if I understand correctly the problem is
$ echo '^NotInGen<np>$' |lt-proc -b nob-nno.autogen.bin
^NotInGen<np>/@NotInGen<np>$
$ echo '^NotInGen<np>$' |lt-proc -b nob-nno.autogen.bin | cg-proc -g -n nob-nno.genprefs.rlx.bin
#NotInGen
doesn't give the tags, while if we use -t it does give tags
$ echo '^NotInGen<np>$' |lt-proc -b nob-nno.autogen.bin | cg-proc -g -n -t nob-nno.genprefs.rlx.bin
#NotInGen\<np\>
but is noisy if a rule actually hit:
$ echo å gafle|apertium -f none -d . nob-nno-dgen |cg-proc -g -n -t nob-nno.genprefs.rlx.bin
å gafla/¬gafle\<v:infa_infe\><REMOVE:26>
or printing tags by default again (unless it is really preferable not to print tags by default)
This is running after the generator, so we do have to get rid of the tags to avoid them ending up in the output shown to the user.
Also, I suppose you only want tags on the stuff we couldn't generate?
if I understand correctly the problem is
$ echo '^NotInGen<np>$' |lt-proc -b nob-nno.autogen.bin ^NotInGen<np>/@NotInGen<np>$ $ echo '^NotInGen<np>$' |lt-proc -b nob-nno.autogen.bin | cg-proc -g -n nob-nno.genprefs.rlx.bin #NotInGendoesn't give the tags, while if we use -t it does give tags
$ echo '^NotInGen<np>$' |lt-proc -b nob-nno.autogen.bin | cg-proc -g -n -t nob-nno.genprefs.rlx.bin #NotInGen\<np\>but is noisy if a rule actually hit:
$ echo å gafle|apertium -f none -d . nob-nno-dgen |cg-proc -g -n -t nob-nno.genprefs.rlx.bin å gafla/¬gafle\<v:infa_infe\><REMOVE:26>
It's exactly this, thanks.
or printing tags by default again (unless it is really preferable not to print tags by default)
This is running after the generator, so we do have to get rid of the tags to avoid them ending up in the output shown to the user.
Also, I suppose you only want tags on the stuff we couldn't generate?
Yes, tags should only appear for lexical units that cannot be generated. The generator is running right before this and trying to generate a surface form, so the input to cg-proc -g -n will only contain tags in cohorts if there's a generation error in the previous step of the pipeline.
The generator is running right before this and trying to generate a surface form, so the input to cg-proc -g -n will only contain tags in cohorts if there's a generation error in the previous step of the pipeline.
Well, there will also be tags on readings if there are variant tags (in addition to the input tags which are there since we use lt-proc -b on the generator):
$ echo blå | apertium -d . nob-nno-dgen
^blå<adj><sint><pst><un><pl><ind>/blå/blåe<v:blå_blåe>$
(that's the input to cg-proc -g -n)
The generator is running right before this and trying to generate a surface form, so the input to cg-proc -g -n will only contain tags in cohorts if there's a generation error in the previous step of the pipeline.
Well, there will also be tags on readings if there are variant tags (in addition to the input tags which are there since we use lt-proc -b on the generator):
$ echo blå | apertium -d . nob-nno-dgen ^blå<adj><sint><pst><un><pl><ind>/blå/blåe<v:blå_blåe>$(that's the input to
cg-proc -g -n)
You're right, of course. For some reason I had assumed these were just removed, but they are tags after all.
I suppose we could distinguish between invalid and valid readings by checking if there's a # or @ in the input. These are added by the generator only if it cannot generate anything. I assume they are also escaped if the generation is valid (there could be a lexical unit beginning with these two characters).