cmark Emphasis and East Asian text

Discussions:

https://github.com/github/markup/issues/1076
https://talk.commonmark.org/t/emphasis-and-east-asian-text/2491

There are three commits:

Change on code,
Proposed change on spec,
Additional test cases (maybe insufficient).

I realised that the change will introduce some ambiguity, but I think they are not actually problem.

Rule 6:

__foo、__bar__、baz__
.
<p><strong>foo、</strong>bar<strong>、baz</strong></p>

is not

<p><strong>foo、<strong>bar</strong>、baz</strong></p>

Rule 7:

**〔**foo〕
.
<p><strong>〔</strong>foo〕</p>

is not

<p>**〔**foo〕</p>

Jun 25 '17 03:06 ikedas

Thanks for doing this!

I think we could simplify this considerably by defining "punctuation character" (for purposes of the spec) so that it simply excludes East Asian pnuctuation characters.

This would really simplify the clauses in the spec for emphasis, since we'd avoid complicated logical constructions like (punctuation and not east asian).

It would also make the code slightly more efficient (one test rather than two -- though perhaps the compiler is smart enough to optimize away this difference).

What do you think?

@kivikakk

Jun 27 '17 08:06 jgm

Thanks for comment.

I think we could simplify this considerably by defining "punctuation character" (for purposes of the spec) so that it simply excludes East Asian pnuctuation characters.

This would really simplify the clauses in the spec for emphasis, since we'd avoid complicated logical constructions like (punctuation and not east asian).

I thought the same at first, but such modification counld not handle many cases using underscores (_). Anyway, for EA writers it is real that EA punctuations should be handled in different way from Western ones.

Another point is that some punctuations are shared among EA and Western, e.g. “, ”. They cannot be excluded.

Jun 27 '17 14:06 ikedas

+++ IKEDA Soji [Jun 27 17 07:19 ]:

I think we could simplify this considerably by defining "punctuation
character" (for purposes of the spec) so that it simply excludes
East Asian pnuctuation characters.

This would really simplify the clauses in the spec for emphasis,
since we'd avoid complicated logical constructions like (punctuation
and not east asian).
I thought the same at first, but such modification counld not handle many cases using underscores (_). Anyway, for EA writers it is real that EA punctuations should be handled in different way from Western ones.

Can you give a specific example of a case where you think what I suggest wouldn't work? I think I can do it in a way that is logically equivalent to yours, but simpler both in the spec and the program.

Another point is that some punctuations are shared among EA and Western, e.g. “, ”. They cannot be excluded.

Yes, the idea would be to define 'punctuation character' to include these but exclude east-asian-only puntuation.

Jun 27 '17 21:06 jgm

Can you give a specific example of a case where you think what I suggest wouldn't work? I think I can do it in a way that is logically equivalent to yours, but simpler both in the spec and the program.

Ok here. In following texts, 「, 」 and 。 are EA punctuations.

Example 1

猫は*「のどか」*という。

猫は_「のどか」_という。

Current master:

<p>猫は*「のどか」*という。</p>
<p>猫は_「のどか」_という。</p>

Excluding EA punctuations:

<p>猫は<em>「のどか」</em>という。</p>
<p>猫は_「のどか」_という。</p>

Expected (with this PR):

<p>猫は<em>「のどか」</em>という。</p>
<p>猫は<em>「のどか」</em>という。</p>

Example 2

猫は*「のどか」*という。犬は*名がない*。

猫は_「のどか」_という。犬は_名がない_。

Current master:

<p>猫は*「のどか」<em>という。犬は</em>名がない*。</p>
<p>猫は_「のどか」<em>という。犬は_名がない</em>。</p>

Excluding EA punctuations:

<p>猫は<em>「のどか」</em>という。犬は<em>名がない</em>。</p>
<p>猫は_「のどか」_という。犬は_名がない_。</p>

Expected (with this PR):

<p>猫は<em>「のどか」</em>という。犬は<em>名がない</em>。</p>
<p>猫は<em>「のどか」</em>という。犬は_名がない_。</p>

Another point is that some punctuations are shared among EA and Western, e.g. “, ”. They cannot be excluded.

Yes, the idea would be to define 'punctuation character' to include these but exclude east-asian-only puntuation.

Excluding these from Western punctuations will not affect Western text, because space before/after punctuation is ordinary in Western texts (␣ means space).

The␣cat␣is␣named␣*“Nodoka”*.

On the other hand including them in EA punctuations will help formatting EA text because spaces before/after punctuation is unnatural in EA texts.

猫は*“のどか”*という。

猫は␣*“のどか”*␣という。            --- unnatural

So I think they would be better to belong to EA punctuations.

Jun 28 '17 03:06 ikedas

Just checking back in here; do we think we might be able to move forward with the suggestion in this PR?

Sep 05 '18 02:09 kivikakk

Just checking back in here; do we think we might be able to move forward with the suggestion in this PR?

Of course I agree. Please let me know if there are anything I should do. I'll re-push my commits.

Sep 05 '18 03:09 ikedas

Hi,

Any updates on this PR?

I think lots of projects are waiting for the update in upstream. :)

Thanks!

Sep 30 '18 04:09 tamlok

Excluding these from Western punctuations will not affect Western text, because space before/after punctuation is ordinary in Western texts (␣ means space).

Not always. Examples:

the Marines’ slogan—“semper fi”—is well known.
he uttered his usual greeting (“hello”).
‘“hello” is longer than “hi”,’ she noted.

Sep 30 '18 05:09 jgm

@jgm, in this pr interaction between punctuations and emphasis matters. Are your examples affected (I haven’t confirmed)?

Sep 30 '18 06:09 ikedas

My point was just that there might be unexpected consequences to treating these characters like non-punctuation, and that it isn't the case that they're never flanked by punctuation characters. It's hard to survey ahead of time all the cases that might arise, but here's one for concreteness:

He stammered, “*hello, I was...*”

If the double quotes get treated as non-punctuation for purposes of determining flankingness, then the final * is not right flanking and we don't get emphasis.

Sep 30 '18 06:09 jgm

If the double quotes get treated as non-punctuation for purposes of determining flankingness, then the final * is not right flanking and we don't get emphasis.

My PR does not treat LEFT/RIGHT DOUBLE QUOTATION MARKs as non-punctuations, but treats them as EA punctuations. In fact, even if my modification was applied:

$ build/src/cmark 
He stammered, “*hello, I was...*”
<p>He stammered, “<em>hello, I was...</em>”</p>
$

Sep 30 '18 07:09 ikedas

Sorry for the misunderstanding.

left_flanking = numdelims > 0 && !cmark_utf8proc_is_space(after_char) &&
                   (!cmark_utf8proc_is_punctuation(after_char) ||
                    cmark_utf8proc_is_eastasian_punctuation(after_char) ||
                    cmark_utf8proc_is_space(before_char) ||
                    cmark_utf8proc_is_punctuation(before_char));
right_flanking = numdelims > 0 && !cmark_utf8proc_is_space(before_char) &&
                  (!cmark_utf8proc_is_punctuation(before_char) ||
                   cmark_utf8proc_is_eastasian_punctuation(before_char) ||
                   cmark_utf8proc_is_space(after_char) ||
                   cmark_utf8proc_is_punctuation(after_char));

Simplifying a bit (EDIT: sorry, first version was completely wrong):

Left flanking:

after char is non-space, AND
one of the following:
- after char is EA punctuation or non-punctuation
- before char is space or punctuation

Right flanking:

before char is non-space, AND
one of the following:
- before char is EA punctuation or non-punctuation
- after char is space or punctuation

The effect of this part of the rule is to make it strictly easier to count as left-flanking and right-flanking, in the cases where a left-flanking run is followed by EA punctuation or a right-flanking run is preceded by EA punctuation. So there won't be examples of the sort I was trying to give, where your rule fails to count something as left- or right-flanking that the original rule does.

Your rule may, however, count some delimiter runs as BOTH left and right flanking where the original rule only has one flankingness. To deal with that, you also modify the rules for "can open" and "can close". The current rule says that a delimiter run that is both left and right flanking can open emphasis when the before char is punctuation. Your rule loosens that up to: when the before char is punctuation or the after char is EA punctuation. This ensures that, in every case where your rule makes a formerly left and not-right flanking delimiter run both left and right flanking, if it could open/close emphasis before it will still be able to open/close emphasis.

However, there could still be changes due to the fact that it could now close emphasis when it couldn't before. So, one kind of example to look for is a case where a delimiter run that formerly could only open emphasis can now both open and close, and gives bad results for that reason. I will think about whether there are realistic examples of this sort.

But, just to make a general comment, one thing I dislike about the proposed change is that it makes an already fairly complicated rule, which I could (barely) keep in my head, even more complicated and hard to think about. That is the reason I've found it difficult to get convinced that this change should be made. It's not by itself a reason to reject the change, but I haven't yet been convinced that the change won't have unanticipated consequences.

Oct 01 '18 17:10 jgm

Here's an (admittedly artifical) example where we'd see a difference, if I'm not mistaken:

*“*there*”*

With the proposed rule, the second * can close emphasis and so we'd get

<em>“</em>there<em>”</em>

whereas currently we get

<em>“<em>there</em>”</em>

Unless I've made a mistake in thinking about it...

Oct 01 '18 17:10 jgm

Another case:

*He said, **“*hello*”**.*

Oct 01 '18 17:10 jgm

I'll investigate your simplified rule afterward (but I want to confirm: It is equivalent to my rule, isn't it?).

Your rule may, however, count some delimiter runs as BOTH left and right flanking where the original rule only has one flankingness. To deal with that, you also modify the rules for "can open" and "can close". The current rule says that a delimiter run that is both left and right flanking can open emphasis when the before char is punctuation. Your rule loosens that up to: when the before char is punctuation or the after char is EA punctuation. This ensures that, in every case where your rule makes a formerly left and not-right flanking delimiter run both left and right flanking, if it could open/close emphasis before it will still be able to open/close emphasis.

What is the reason for "unique flankingness" requirement? For me, flankingness looks introduced only to describe behavior of the parser (without consideration of EA context).

However, there could still be changes due to the fact that it could now close emphasis when it couldn't before. So, one kind of example to look for is a case where a delimiter run that formerly could only open emphasis can now both open and close, and gives bad results for that reason. I will think about whether there are realistic examples of this sort.

It is natural that modification of rules will cause change of behavior. We have to modify rules if the rules can't handle texts as we expect.

I can't decide whether changes brought to existing texts will be acceptable or not. There seem these options:

In below, an "ambiguous punctuation" is a punctuation character having east_asian_width property "A", and can be used in both East Asian and Western contexts, including: ¡, ¿, –, —, ‘, ’, “, ”.

Reject entire changes by this PR. --- Obviously uncomfortable for East Asian writers.
Treat ambiguous punctuations as non-East Asian punctuations --- A bit uncomfortable for East Asian writers.
Add an option (during compilation or runtime) to treat ambiguous punctuations either as East Asian or non-East Asian punctuations according to choice of users.
Treat ambiguous punctuations East Asian punctuations --- More or less uncomfortable for Western writers.

Oct 02 '18 06:10 ikedas

Another case:

*He said, **“*hello*”**.*

I'll add corresponding examples with East Asian context.

For example,

*他說，**“*你好*”**。*

will be handled by current master properly (Note: ， is not comma + space but an EA punctualtion), as:

<p><em>他說，<strong>“<em>你好</em>”</strong>。</em></p>

However, example above is a lucky case. Perhaps this sentence is understandable without ，. Removing it,

*他說**“*你好*”**。*

will be rendered with current master as:

<p><em>他說</em>*“<em>你好</em>”**。*</p>

I think it is hard to accept this result for writers.

As a workaround, for example, we might recommend writers to markup such as:

*他說 **“*你好*”**。*

This will be rendered as:

<p><em>他說 <strong>“<em>你好</em>”</strong>。</em></p>

The result is readalbe, if readers ignored an ugry space. However, it may not be easy to give excuse to force writers inserting unusual spaces not appeared in plain text witout markup.

Note: My PR will not solve all problems with current master: It can not handle as complex markup in East Asian context as Western context. In fact, since the example above is slightly complex, it will be rendered with my PR as:

<p><em>他說**“</em>你好<em>”**。</em></p>

However, from view of East Asian writers, it will improve current behavior much.

Oct 02 '18 07:10 ikedas

Yes, my simplified rephrasing was meant to be equivalent to your proposal. (Just to help me think about it more clearly.)

Thinking outside of the box a bit: instead of having two distinct classes of punctuation characters, would it work to treat East Asian characters in general (including both EA punctuation and EA non-punctuation characters) as equivalent to punctuation for determining flankingness and can-open/can-close?

That is: the rules would all be the same as they are, except that "punctuation" would be interpreted as including Western punctuation characters plus ALL EA characters. (Obviously, one might want a better name for this broad class than "punctuation," but that's a detail.)

This would keep the simpler logic of the current rules, and it would guarantee that nothing changes in the interpretation of Western texts.

Oct 02 '18 16:10 jgm

Just wondering is there any progress on this?

All CJK projects based on CommonMark just stuck on it for years.

Jan 18 '19 16:01 cangyuyao

Maybe this issue can be seen better from a different perspective. At least I have always found using the left-flanking and right-flanking terms confusing and I always easily got lost in them when thinking about some particular complicated input example.

Eventually I started to use in my head an alternative wording which (I believe) is 100%-equivalent to the current specs wording. It may be spelled as follows:

Left score and right score of the delimiter run determine whether the run may or may not open/close an emphasis. The scores are computed as follows:

If the preceding character is Unicode whitespace, set the left score to 0. If the preceding character is Unicode punctuation, set the left score to 1. If the preceding character is anything else, set the left score to 2.

If the subsequent character is Unicode whitespace, set the right score to 0. If the subsequent character is Unicode punctuation, set the right score to 1. If the subsequent character is anything else, set the right score to 2.

If left score == 2 and right score == 2, and the delimiter run is _-based, then reset both scores to zero.

The delimiter run can open an emphasis iff left score <= right score and right score > 0. The delimiter run can close an emphasis iff left score >= right score and left score > 0.

(If you prefer code, MD4C uses internally this alternative wording.)

I post this because it might be easier to come with the solution in this wording, if we just add more rules into the score calculations above. Imho, it could perhaps even solve the issue with the ambiguous punctuation noted in earlier comments. E.g. something like

If the preceding character is EA-punctuation and the subsequent character is any EA-character, then reset right score to zero. (I.e. this makes it to be treated as if there is any punctuation before the run and whitespace after it in current implementation.) If the subsequent character is EA-punctuation and the preceding character is any EA-character, then reset left score to zero. (I.e. this makes it to be treated as if there is any punctuation after the run and whitespace before it in current implementation.)

At least, it can be easily seen this wouldn't change anything for western text, and the people who (unlike me) understand EA languages and their needs may play more safely as long as they propose rules which require EA-characters on both sides of the run. Divide et impera.

May 30 '19 19:05 mity

Although this PR works for Japanese and Chinese text (please note that Korean text uses "Western" punctuation marks), it does not solve a related but slightly different issue in Korean text reported here (github/javascript-tutorial, #2040).

Koreans expect *스크립트(script)*라고 to be rendered to <em>스크립트(script)</em>라고. Since Korean text uses "Western" punctuation marks, the current CommonMark spec or this PR does not render the above Korean text "correctly."

This Korean-text issue may be resolved by adding one more condition to @jgm's simple rule in this comment:

Right flanking:

before char is non-space, AND
one of the following:
- before char is EA punctuation or non-punctuation
- after char is space or punctuation or any EA character,

although it will break nested emphases more severely.

By the way, I think a better way to solve CJK-related emphasis issues is to introduce a new syntax ~_, _~, ~*, and *~ originally suggested by Prof. John MacFarlane for intra-word emphasis. However, his suggestion is equally applicable to any CJK-related emphasis issues arising from the lack of whitespace.

Aug 21 '20 04:08 spencer246

It seems that the issue on emphasizing Korean texts has not been reported before.

I posted this issue in https://talk.commonmark.org/t/emphasis-and-east-asian-text/2491 as a comment.

Aug 21 '20 04:08 spencer246

Sorry I haven't had time to think over this issue. I have one idea I would like to try and will post it later (perhaps in months).

Jul 17 '22 07:07 ikedas

cmark cmark copied to clipboard

Emphasis and East Asian text

Example 1

Example 2

cmark
cmark copied to clipboard