MarkdownViewerPlusPlus icon indicating copy to clipboard operation
MarkdownViewerPlusPlus copied to clipboard

When adjacent lexical words are both marked up

Open DavidHaslam opened this issue 7 months ago • 4 comments

Some languages have writing systems in which to a large extent, there are no spaces between words.

Examples include:

  1. Thai
  2. Lao
  3. Khmer
  4. Burmese (Myanmar)

Until the Middle Ages, most languages in Europe were similar, as Scriptio Continua was standard practice before the origins of Silent Reading and the invention of the printing press by Gutenberg.

When adding bold markup algorithmically to lexical names in Thai text, I came across an example in which two adjacent Thai words were names, and as there was no space between them, this is the result:

Hosea 1:7: แต่เราจะมีความเมตตาต่อวงศ์วานของ**ยูดาห์** และจะช่วยพวกเขาให้รอดพ้นโดย**พระเยโฮวาห์****พระเจ้า**ของพวกเขา และจะไม่ช่วยพวกเขาให้รอดพ้นโดยคันธนู หรือโดยดาบ หรือโดยการสู้รบ โดยม้าทั้งหลาย หรือโดยเหล่าทหารม้า”

which displays as follows:

Hosea 1:7: แต่เราจะมีความเมตตาต่อวงศ์วานของยูดาห์ และจะช่วยพวกเขาให้รอดพ้นโดยพระเยโฮวาห์****พระเจ้าของพวกเขา และจะไม่ช่วยพวกเขาให้รอดพ้นโดยคันธนู หรือโดยดาบ หรือโดยการสู้รบ โดยม้าทั้งหลาย หรือโดยเหล่าทหารม้า”

This could be due to a software bug in MarkdownViewer++ or maybe it's a weakness in the specification for Markdown itself. The fact that GitHub does the same suggests that it's the latter.

Either way, the result is not what is required when the markup is applied individually for two or more adjacent words with no intervening space.

A workaround meanwhile is to place a ZWSP between the words prior to adding the Markdown asterisks.

For further details, please see this link to my ongoing conversation with Grok.

Background reading:

  1. https://en.wikipedia.org/wiki/Scriptio_continua
  2. https://amzn.eu/d/1cbeGkD

DavidHaslam avatar May 26 '25 10:05 DavidHaslam

The second further reading link is for this book:

Space Between Words: The Origins of Silent Reading (Figurae: Reading Medieval Culture) Paperback – Illustrated, 1 Jan. 2000 by Paul Saenger (Author)

DavidHaslam avatar May 26 '25 10:05 DavidHaslam

Markdown is just a filter for HTML. It is meant to prioritize writing text meant for display in a convenient way; it is not meant to provide the same level of control as composing HTML correctly. That said, it supports some amount of HTML passthrough. If you want to have distinct <strong/> tags for consecutive bold character sequences, you can just write word1<strong>word2</strong><strong>word3</strong>word4 or word1**word2**<strong>word3</strong>word4 instead. See https://spec.commonmark.org/dingus/?text=Word1%20word2%20word3%20word4%2e%0A%0AWord1word2%3Cstrong%3Eword3%3C%2Fstrong%3Eword4%2e for a demo.

binki avatar May 26 '25 13:05 binki

Q. Is there anything defined in CommonMark that covers this?

DavidHaslam avatar May 26 '25 18:05 DavidHaslam

If you look at the dingus I linked, it is the reference CommonMark implementation. See how it handles this here: https://spec.commonmark.org/dingus/?text=Word1%20word2****word3%20word4 .

For the spec itself, it has a long section and many examples of how various situations are handled at https://spec.commonmark.org/0.31.2/#emphasis-and-strong-emphasis .

Back to what I said originally and what many people in CommonMark forums might say: Markdown/CommonMark is just a filter outputting HTML and meant to transform what people naturally type in documents into HTML. It is not meant to provide the level of control to say that a <strong/> should be divided between letters which are adjacent. In HTML, both <strong>asdf</strong> and <strong>a</strong><strong>sdf</strong> render the same (unless if you provide some additional styling). So trying to do something like tagging multiple letters in the same word as distinct through Markdown itself is not really a use case of standard Markdown/CommonMark. Also, Markdown/CommonMark already provide HTML passthrough. So it is perfectly valid to simply write <strong>a</strong><strong>sdf</strong> as Markdown if that is your desired output.

binki avatar May 26 '25 18:05 binki