lute-v3 icon indicating copy to clipboard operation
lute-v3 copied to clipboard

Punctuation being disconnected from word

Open jamesdeluk opened this issue 1 year ago • 5 comments

In the edit text box the punctuation (in this case, ") sticks to the word: image image

However, in read mode it gets left behind: image

This also happens with other punctuation: image

I checked the language settings but doesn't see if there was a config there to change that would fix this (I've already added 가-힣 to the word characters box which covers all Korean characters).

jamesdeluk avatar Feb 20 '24 22:02 jamesdeluk

Thanks for taking the time to submit the issue.

This one is tough to solve, I'm not sure how to do it. It has to do with how Lute renders items on the page. Each element is put in an html span element: words, punctuation, spaces, etc. So something like "dog." is rendered as <span>dog</span><span>.</span> ... and I'm not sure if it's possible to force spans to stay on the same line.

There may be a different way to render things, but it might be tough!

jzohrab avatar Feb 21 '24 05:02 jzohrab

I see. It seems it might be even more complicated given how textsentences are done:

<span id="ID-277-1" class="textitem click word word2678 showtooltip status1" data-lang-id="14" data-paragraph-id="4" data-sentence-id="15" data-text="중얼거렸다" data-status-class="status1" data-order="277" data-wid="2678" style="font-size: 1rem; margin-bottom: 0px;">중얼거렸다</span>
<span id="ID-278-1" class="textitem" data-lang-id="14" data-paragraph-id="4" data-sentence-id="15" data-text=". “" data-status-class="status0" data-order="278" style="font-size: 1rem; margin-bottom: 0px;">.&nbsp;“</span>
</span>
<span class="textsentence" id="sent_16">
<span id="ID-279-1" class="textitem click word word2680 showtooltip status1" data-lang-id="14" data-paragraph-id="4" data-sentence-id="16" data-text="진작" data-status-class="status1" data-order="279" data-wid="2680" style="font-size: 1rem; margin-bottom: 0px;">진작</span>

The second span contains . “ (.&nbsp;“) as a single unit, at the end of one textsentence. It would need to be split at the into . and , with the . appending to the previous span (without effecting word, data-text, etc), the . “ textitem changed into , and prepending the span in the next textsentence, i.e. something like:

<span id="ID-277-1" class="textitem click word word2678 showtooltip status1" data-lang-id="14" data-paragraph-id="4" data-sentence-id="15" data-text="중얼거렸다" data-status-class="status1" data-order="277" data-wid="2678" style="font-size: 1rem; margin-bottom: 0px;">중얼거렸다.</span>
<span id="ID-278-1" class="textitem" data-lang-id="14" data-paragraph-id="4" data-sentence-id="15" data-text=" " data-status-class="status0" data-order="278" style="font-size: 1rem; margin-bottom: 0px;">&nbsp;</span>
</span>
<span class="textsentence" id="sent_16">
<span id="ID-279-1" class="textitem click word word2680 showtooltip status1" data-lang-id="14" data-paragraph-id="4" data-sentence-id="16" data-text="진작" data-status-class="status1" data-order="279" data-wid="2680" style="font-size: 1rem; margin-bottom: 0px;">““진작</span>

So yes, perhaps not the simplest thing!

jamesdeluk avatar Feb 21 '24 07:02 jamesdeluk

Yeah, there may be a better way (likely is a better way) to parse and render all of this stuff, but it's tough. Thanks for the issue though, maybe someone will feel like looking into it. Unlikely, but you never know.

jzohrab avatar Feb 22 '24 05:02 jzohrab

This frequently concerns French, where d' (elision of "de") is split – for instance,

La France continuera aussi d'œuvrer en vue d' un cessez-le-feu immédiat et durable

yue-dongchen avatar May 16 '24 05:05 yue-dongchen

One way to do it might be to add additional <span>s around words followed by known punctuation symbols and then applying a white-space: nowrap style to the wrapping <span>.

Example paragraph (before)

image

Example paragraph (after)

image

I've added the styles manually here but you can see the commas are now stuck to "empfindlich" and "heiß". This would probably require a separate language configuration field for each language to define any "sticky" punctuation symbols so the extra spans could be added during parsing.

HTML example

<span style="white-space: nowrap">
  <span id="ID-127-1" class="textitem click word word2532 showtooltip status1" data-lang-id="1" data-paragraph-id="5" data-sentence-id="11" data-text="empfindlich" data-status-class="status1" data-order="127" data-wid="2532" style="font-size: 1.125rem; margin-bottom: 5.4px;">empfindlich</span>
  <span id="ID-128-1" class="textitem" data-lang-id="1" data-paragraph-id="5" data-sentence-id="11" data-text=", " data-status-class="status0" data-order="128" style="font-size: 1.125rem; margin-bottom: 5.4px;">,&nbsp;</span>
</span>

Here's how it'd look in the HTML, although ideally using CSS to style the wrapping span instead of an inline style.

cblanken avatar May 16 '24 13:05 cblanken