omd icon indicating copy to clipboard operation
omd copied to clipboard

support and test UTF-8 output

Open avsm opened this issue 5 years ago • 9 comments

From @nojb:

UTF-8 support is missing.

See also previous discussion in #27

(this is tracking blockers to a new omd release from #174)

avsm avatar Apr 19 '19 13:04 avsm

@clecat I believe having UTF-8 is also important for mdx integration, since even English books often contain Unicode symbols for different reasons, e.g. mathematical or computer science symbols.

XVilka avatar Aug 19 '19 08:08 XVilka

Unicode symbols can appear in English in ordinary prose text. A simple example is fancy quotes. UTF-8 support is critical for practically anything outside source code, configuration files, and other computer files.

aantron avatar Apr 23 '20 09:04 aantron

Currently, omd is byte-based so it will take (and spit back out) arbitrary byte sequences, including UTF-8.

nojb avatar Apr 23 '20 09:04 nojb

"Proper" UTF-8 support would enforce the encoding and also take that into account when case or other normalization is needed.

nojb avatar Apr 23 '20 09:04 nojb

Ok, yes, thanks. I was misled by other tools and the title of this issue, but omd indeed wasn't the problem.

Perhaps this issue should be renamed to only "test...", or otherwise clarified.

aantron avatar Apr 23 '20 09:04 aantron

And similarly for #27.

aantron avatar Apr 23 '20 09:04 aantron

Strictly speaking, this is not a blocker. The current byte-based approach is good enough for the vast majority of cases. I'll see if we can plug in a UTF-8 decoder easily for the 2.0 release, but if not full Unicode support will have to wait until after 2.0.

nojb avatar Jun 20 '20 14:06 nojb

Spec 175, which is currently disabled fails due to inaccurate handling of upper/lower case conversion on unicode URLs. So proper support for this issue is a blocker for #235, which is a blocker for the 2.0 release milestone.

Failing verification tests:

diff --git a/tests/spec-175.html b/tests/spec-175.html.new
index 93c6540..97ea23b 100644
--- a/tests/spec-175.html
+++ b/tests/spec-175.html.new
@@ -1 +1,2 @@
-<p><a href="/%CF%86%CE%BF%CF%85">αγω</a></p>
+<p>[ΑΓΩ]: /φου</p>
+<p>[αγω]</p>
         git (internal) (exit 1)
(cd _build/default && /usr/bin/git --no-pager diff --no-index --color=always -u tests/spec-536.html tests/spec-536.html.new)
diff --git a/tests/spec-536.html b/tests/spec-536.html.new
index afe4557..8f8663f 100644
--- a/tests/spec-536.html
+++ b/tests/spec-536.html.new
@@ -1 +1 @@
-<p><a href="/url">Толпой</a> is a Russian word.</p>
+<p>[Толпой][Толпой] is a Russian word.</p>

From example

[ΑΓΩ]: /φου

[αγω]
.
<p><a href="/%CF%86%CE%BF%CF%85">αγω</a></p>

https://github.com/madroach/omd/blob/master/tests/spec.txt#L2985-L2991

and

[Толпой][Толпой] is a Russian word.

[ТОЛПОЙ]: /url
.
<p><a href="/url">Толпой</a> is a Russian word.</p>

https://github.com/madroach/omd/blob/master/tests/spec.txt#L8072-L8078

shonfeder avatar May 29 '21 16:05 shonfeder

Reclassified as a bug, since this is causing us to fail verification against the spec

shonfeder avatar May 29 '21 16:05 shonfeder