lttoolbox
lttoolbox copied to clipboard
Option to output '~' as compound separator
apertium-pretransfer
has option -e treat ~ as compound separator
– I don't know if any other tools have this, but it would be nice if we could implement support for that throughout the pipeline so that we can keep +
in the sense of <j/>
and ~
in the sense of compounds separate.
Motivation:
Currently, transfer has no way of knowing if there was an actual space in input or it was just placed there by pretransfer which saw a + and output a space.
If in transfer you want to match a compound followed by something else, you currently have to output the first two parts with no <b/>
and then a <b/>
– but that first blank that you output will be the space that was added by pretransfer, with rules like
<out>
<lu><clip pos="1" side="tl" part="whole"/></lu> <!-- no b/ here -->
<lu><clip pos="2" side="tl" part="whole"/></lu><b/> <!-- this will output the blank that was between 1 and 2! -->
<lu><clip pos="3" side="tl" part="whole"/></lu>
</out>
This means that on the one hand we get luftputebåten<em>min</em>
→ lt-proc → luftpute+båten[<em>]min[</em>]
→ pretransfer → luftpute båten[<em>]min[</em>]
→ transfer → luftputebåten min[<em>][</em>]
, but also that it's not really possible to make a general rule that matches, say, both dynamic compounds and number-compounds that should be treated the same way but had no space added by pretransfer (that first <b/>
will then be empty, turning 2.-kvartalet deres
into 2.kvartaletdeira
).