omd
omd copied to clipboard
support and test UTF-8 output
From @nojb:
UTF-8 support is missing.
See also previous discussion in #27
(this is tracking blockers to a new omd release from #174)
@clecat I believe having UTF-8 is also important for mdx integration, since even English books often contain Unicode symbols for different reasons, e.g. mathematical or computer science symbols.
Unicode symbols can appear in English in ordinary prose text. A simple example is fancy quotes. UTF-8 support is critical for practically anything outside source code, configuration files, and other computer files.
Currently, omd
is byte-based so it will take (and spit back out) arbitrary byte sequences, including UTF-8.
"Proper" UTF-8 support would enforce the encoding and also take that into account when case or other normalization is needed.
Ok, yes, thanks. I was misled by other tools and the title of this issue, but omd indeed wasn't the problem.
Perhaps this issue should be renamed to only "test...", or otherwise clarified.
And similarly for #27.
Strictly speaking, this is not a blocker. The current byte-based approach is good enough for the vast majority of cases. I'll see if we can plug in a UTF-8 decoder easily for the 2.0 release, but if not full Unicode support will have to wait until after 2.0.
Spec 175, which is currently disabled fails due to inaccurate handling of upper/lower case conversion on unicode URLs. So proper support for this issue is a blocker for #235, which is a blocker for the 2.0 release milestone.
Failing verification tests:
diff --git a/tests/spec-175.html b/tests/spec-175.html.new
index 93c6540..97ea23b 100644
--- a/tests/spec-175.html
+++ b/tests/spec-175.html.new
@@ -1 +1,2 @@
-<p><a href="/%CF%86%CE%BF%CF%85">αγω</a></p>
+<p>[ΑΓΩ]: /φου</p>
+<p>[αγω]</p>
git (internal) (exit 1)
(cd _build/default && /usr/bin/git --no-pager diff --no-index --color=always -u tests/spec-536.html tests/spec-536.html.new)
diff --git a/tests/spec-536.html b/tests/spec-536.html.new
index afe4557..8f8663f 100644
--- a/tests/spec-536.html
+++ b/tests/spec-536.html.new
@@ -1 +1 @@
-<p><a href="/url">Толпой</a> is a Russian word.</p>
+<p>[Толпой][Толпой] is a Russian word.</p>
From example
[ΑΓΩ]: /φου
[αγω]
.
<p><a href="/%CF%86%CE%BF%CF%85">αγω</a></p>
https://github.com/madroach/omd/blob/master/tests/spec.txt#L2985-L2991
and
[Толпой][Толпой] is a Russian word.
[ТОЛПОЙ]: /url
.
<p><a href="/url">Толпой</a> is a Russian word.</p>
https://github.com/madroach/omd/blob/master/tests/spec.txt#L8072-L8078
Reclassified as a bug, since this is causing us to fail verification against the spec