marked
marked copied to clipboard
Syntax parse fails with Japanese punctuation (`、`), strong syntax and code syntax
Marked version:
- v4.0.18
Describe the bug A clear and concise description of what the bug is.
Copy from https://github.com/volca/markdown-preview/issues/135.
The case below, it does not parse syntax correctly.
% cat test.md
* ×: あれ、**`foo`これ**、それ
* ○: あれ、 **`foo`これ**、それ
* ×: あれ、**`foo`これ** 、それ
* ○: あれ、**fooこれ**、それ
* ○: あれ、 **fooこれ**、それ
* ○: あれ、**fooこれ** 、それ
% npx marked --version
4.0.18
% npx marked < test.md
<ul>
<li><p>×: あれ、**<code>foo</code>これ**、それ</p>
</li>
<li><p>○: あれ、 <strong><code>foo</code>これ</strong>、それ</p>
</li>
<li><p>×: あれ、**<code>foo</code>これ** 、それ</p>
</li>
<li><p>○: あれ、<strong>fooこれ</strong>、それ</p>
</li>
<li><p>○: あれ、 <strong>fooこれ</strong>、それ</p>
</li>
<li><p>○: あれ、<strong>fooこれ</strong> 、それ</p>
</li>
</ul>
With Japanese punctuation (、
), strong syntax (**
), and code syntax (`
),
it needs some space to make them parsed correctly (The former 3 examples).
Although, without code syntax, no extra space is required (The latter 3 examples).
So it isn't a syntax parsing problem with CJK symbol characters?
To Reproduce Steps to reproduce the behavior:
As above.
Expected behavior A clear and concise description of what you expected to happen.
Parse the syntax correctly as Pandoc.
% pandoc --version
pandoc.exe 2.18
Compiled with pandoc-types 1.22.2, texmath 0.12.5, skylighting 0.12.3,
citeproc 0.7, ipynb 0.2, hslua 2.2.0
Scripting engine: Lua 5.4
User data directory: C:\Users\yasuda\AppData\Roaming\pandoc
Copyright (C) 2006-2022 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
% pandoc < test.md
<ul>
<li><p>×: あれ、<strong><code>foo</code>これ</strong>、それ</p></li>
<li><p>○: あれ、 <strong><code>foo</code>これ</strong>、それ</p></li>
<li><p>×: あれ、<strong><code>foo</code>これ</strong> 、それ</p></li>
<li><p>○: あれ、<strong>fooこれ</strong>、それ</p></li>
<li><p>○: あれ、 <strong>fooこれ</strong>、それ</p></li>
<li><p>○: あれ、<strong>fooこれ</strong> 、それ</p></li>
</ul>
looks like the issue is that 、
is not included as punctuation for left delimiter.
According to the spec the puctuation should include:
an ASCII punctuation character or anything in the general Unicode categories
Pc
,Pd
,Pe
,Pf
,Pi
,Po
, orPs
.
So, now you support only ASCII punctuations, right?
The character 、
(U+3001
, Ideographic Comma) being in Unicode Po
category,
it's one of 'Unicode punctuation character'.
Could you support such Unicode punctuations?
And,
(U+3000
, Ideographic Space) is a 'Unicode whitespace character' as Zs
category character.
I think it should be also supported as a space character besides space (U+0020
) and tab (U+0009
), if not yet.
Hi @UziTech can I work on this too? This one looks interesting 😀 . I might need to have some tests for japanese and chinese texts too.
@azmy60 ya you can take any that you think you can help with
There is an exhaustive collection of utf8 punctuation in CommonMark. Do you think we should add all of it @UziTech ? I'm not really sure how to make the tests though. Adding the Ideographic Comma (as @KSR-Yasuda suggested) to the punctuation list works just fine with his example.
[UPDATE]
There is a stackoverflow answer for the punctuation codes. It's only up to 4 hex-digits since JavaScript only support up to \uFFFF
.
Apparently, adding the rest of unicode punctuations also fixes #2041 by having \uFF01
.