lttoolbox
lttoolbox copied to clipboard
`lt-proc -g -b` should output @ symbol when there are unconsumed tags
For regular bidix lt-proc -b, we want to just copy over unconsumed tags and that is fine:
$ echo '^kake<n><m><unconsumed>$' |lt-proc -b nob-nno.autobil.bin
^kake<n><m><unconsumed>/kake<n><f><unconsumed>$
When using regular generation lt-proc -g, unconsumed tags lead to #
-marks:
$ echo '^kake<n><f><sg><ind><unconsumed>$' |lt-proc -g nob-nno.autogen.bin
#kake
$ echo '^kake<n><f><sg><ind><unconsumed>$' |lt-proc --debugged-gen nob-nno.autogen.bin
#kake\<n\>\<f\>\<sg\>\<ind\>
But when using lt-proc in bilingual mode on a generator, we get the unconsumed tag without any debug symbol:
$ echo '^kake<n><f><sg><ind><unconsumed>$' |lt-proc -g -b nob-nno.autogen.bin
^kake<n><f><sg><ind><unconsumed>/kake<unconsumed>$
(while completely-unmatched words do get a @
)
This can lead to hard-to-debug issues when we have a partial match; after the following cg-proc we just see the lemma as if it were the form and no hint about it not being found in the generator.
Ideally, when switch -b
is given after -g
(or -d
), we would get an @
when there are unconsumed input tags. Note: we don't want an @ if there are output tags, e.g.
$ echo '^lykke<n><f><sg><ind>$' |lt-proc -g -b nob-nno.autogen.bin
^lykke<n><f><sg><ind>/lykke/lukke<v:lykke_lukke.vok-y2u>$
is still correct (here the whole input is consumed, there are no leftovers, but there is still a tag in output). But we want
$ echo '^kake<n><f><sg><ind><unconsumed>$' |lt-proc -g -b nob-nno.autogen.bin
^kake<n><f><sg><ind><unconsumed>/@kake$
and perhaps
$ echo '^kake<n><f><sg><ind><unconsumed>$' |lt-proc -d -b nob-nno.autogen.bin
^kake<n><f><sg><ind><unconsumed>/@kake\<unconsumed\>$
(though the details of -g vs -d are less important than just having the @ in there)