lttoolbox icon indicating copy to clipboard operation
lttoolbox copied to clipboard

Option to output '~' as compound separator

Open unhammer opened this issue 8 months ago • 1 comments

apertium-pretransfer has option -e treat ~ as compound separator – I don't know if any other tools have this, but it would be nice if we could implement support for that throughout the pipeline so that we can keep + in the sense of <j/> and ~ in the sense of compounds separate.

Motivation:

Currently, transfer has no way of knowing if there was an actual space in input or it was just placed there by pretransfer which saw a + and output a space.

If in transfer you want to match a compound followed by something else, you currently have to output the first two parts with no <b/> and then a <b/> – but that first blank that you output will be the space that was added by pretransfer, with rules like

<out>
<lu><clip pos="1" side="tl" part="whole"/></lu>              <!-- no b/ here -->
<lu><clip pos="2" side="tl" part="whole"/></lu><b/>     <!-- this will output the blank that was between 1 and 2! -->
<lu><clip pos="3" side="tl" part="whole"/></lu>
</out>

This means that on the one hand we get luftputebåten<em>min</em> → lt-proc → luftpute+båten[<em>]min[</em>] → pretransfer → luftpute båten[<em>]min[</em>] → transfer → luftputebåten min[<em>][</em>], but also that it's not really possible to make a general rule that matches, say, both dynamic compounds and number-compounds that should be treated the same way but had no space added by pretransfer (that first <b/> will then be empty, turning 2.-kvartalet deres into 2.kvartaletdeira).

unhammer avatar Oct 19 '23 09:10 unhammer