marked icon indicating copy to clipboard operation
marked copied to clipboard

Syntax parse fails with Japanese punctuation (`、`), strong syntax and code syntax

Open KSR-Yasuda opened this issue 2 years ago • 3 comments

Marked version:

  • v4.0.18

Describe the bug A clear and concise description of what the bug is.

Copy from https://github.com/volca/markdown-preview/issues/135.

The case below, it does not parse syntax correctly.

% cat test.md
* ×: あれ、**`foo`これ**、それ
* ○: あれ、 **`foo`これ**、それ
* ×: あれ、**`foo`これ** 、それ

* ○: あれ、**fooこれ**、それ
* ○: あれ、 **fooこれ**、それ
* ○: あれ、**fooこれ** 、それ

% npx marked --version
4.0.18

% npx marked < test.md
<ul>
<li><p>×: あれ、**<code>foo</code>これ**、それ</p>
</li>
<li><p>○: あれ、 <strong><code>foo</code>これ</strong>、それ</p>
</li>
<li><p>×: あれ、**<code>foo</code>これ** 、それ</p>
</li>
<li><p>○: あれ、<strong>fooこれ</strong>、それ</p>
</li>
<li><p>○: あれ、 <strong>fooこれ</strong>、それ</p>
</li>
<li><p>○: あれ、<strong>fooこれ</strong> 、それ</p>
</li>
</ul>

With Japanese punctuation (), strong syntax (**), and code syntax (`), it needs some space to make them parsed correctly (The former 3 examples).

Although, without code syntax, no extra space is required (The latter 3 examples).

So it isn't a syntax parsing problem with CJK symbol characters?

To Reproduce Steps to reproduce the behavior:

As above.

Expected behavior A clear and concise description of what you expected to happen.

Parse the syntax correctly as Pandoc.

% pandoc --version
pandoc.exe 2.18
Compiled with pandoc-types 1.22.2, texmath 0.12.5, skylighting 0.12.3,
citeproc 0.7, ipynb 0.2, hslua 2.2.0
Scripting engine: Lua 5.4
User data directory: C:\Users\yasuda\AppData\Roaming\pandoc
Copyright (C) 2006-2022 John MacFarlane. Web:  https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.

% pandoc < test.md
<ul>
<li><p>×: あれ、<strong><code>foo</code>これ</strong>、それ</p></li>
<li><p>○: あれ、 <strong><code>foo</code>これ</strong>、それ</p></li>
<li><p>×: あれ、<strong><code>foo</code>これ</strong> 、それ</p></li>
<li><p>○: あれ、<strong>fooこれ</strong>、それ</p></li>
<li><p>○: あれ、 <strong>fooこれ</strong>、それ</p></li>
<li><p>○: あれ、<strong>fooこれ</strong> 、それ</p></li>
</ul>

KSR-Yasuda avatar Jul 12 '22 00:07 KSR-Yasuda

looks like the issue is that is not included as punctuation for left delimiter.

According to the spec the puctuation should include:

an ASCII punctuation character or anything in the general Unicode categories Pc, Pd, Pe, Pf, Pi, Po, or Ps.

UziTech avatar Jul 12 '22 16:07 UziTech

So, now you support only ASCII punctuations, right?

The character (U+3001, Ideographic Comma) being in Unicode Po category, it's one of 'Unicode punctuation character'.

Could you support such Unicode punctuations?

KSR-Yasuda avatar Jul 13 '22 00:07 KSR-Yasuda

And,   (U+3000, Ideographic Space) is a 'Unicode whitespace character' as Zs category character.

I think it should be also supported as a space character besides space (U+0020) and tab (U+0009), if not yet.

KSR-Yasuda avatar Jul 13 '22 00:07 KSR-Yasuda

Hi @UziTech can I work on this too? This one looks interesting 😀 . I might need to have some tests for japanese and chinese texts too.

azmy60 avatar May 19 '23 02:05 azmy60

@azmy60 ya you can take any that you think you can help with

UziTech avatar May 19 '23 02:05 UziTech

There is an exhaustive collection of utf8 punctuation in CommonMark. Do you think we should add all of it @UziTech ? I'm not really sure how to make the tests though. Adding the Ideographic Comma (as @KSR-Yasuda suggested) to the punctuation list works just fine with his example.

[UPDATE] There is a stackoverflow answer for the punctuation codes. It's only up to 4 hex-digits since JavaScript only support up to \uFFFF.

Apparently, adding the rest of unicode punctuations also fixes #2041 by having \uFF01.

azmy60 avatar May 20 '23 06:05 azmy60