djot
djot copied to clipboard
Support entities (anyway, please!)
I know it's said that HTML-style entities are not supported because djot is not to favor any target format, but I wonder if it wouldn't be a good idea to have a mechanism for including characters which are hard to type, and entities is a well-known syntax for that, which I would say is good enough.[^1] I can share a Lua table mapping HTML 5 entity names to UTF-8 characters, but supporting only numeric entities would be a reasonable limitation, since djot would only borrow the syntax. Those can be handled very effectively in Lua, e.g.
str:gsub('(%&(%#?%w%w-)%;)', function (entity,id)
if id:match('^#') then
local cp = tonumber(id:gsub('^#', '0'))
if cp and cp >= 0 and cp <= 0x10ffff then
return char(cp)
end
end
error("Unsupported or invalid entity: " .. entity, 2)
end
)
where char
can be either utf8.char
or this:
function char(a)
local cp = math.floor(assert( tonumber(a), "Expected number but got " .. tostring(type(cp))))
if cp < 0 or cp > 0x10ffff then
error("Codepoint is out of range: " .. a)
end
if cp < 128 then
return string.char(cp)
end
local s = ""
local prefix_max = 32
while true do
local suffix = cp % 64
s = string.char(128 + suffix) .. s
cp = (cp - suffix) / 64
if cp < prefix_max then
return string.char((256 - (2 * prefix_max)) + cp) .. s
end
prefix_max = prefix_max / 2
end
end
[^1]: I would prefer a paired delimiter. My string interpolation DSL uses @(...)
where the parentheses may contain one or more of (1) a decimal code point like 331
, (2) a hex codepoint like 0x14b
, (3) an entity name like eng
, or a Unicode name in angle brackets like <Latin small letter eng>
(in the Perl implementation).
i think supporting inserting characters by codepoint is a good thing—especially with invisible or confusable characters it can be useful. i think HTML entity names are not so good; many of them are essentially legacy and the coverage is not necessarily complete or well‐thought‐out.
i don’t like the XML/HTML entity reference syntax because it makes the decimal form of codepoints &#NNNN;
easier to type than the hexadecimal form &#xNNNN;
. hexadecimal makes much more sense for unicode and i’m not sure that it even makes sense for decimal codepoints to be supported.
why not extend the emoji syntax to allow arbitrary characters by unicode codepoint, like :U+2764:
? perhaps even multiple characters could be included, such as :U+2764.FE0E:
(.
is commonly used in unicode documentation for delimiting sequences of codepoints). emoji are already a kind of entity reference, after all.
As there are escapes already, why not add unicode escapes as supported in many programming languages? Along the lines of \u1234
If so Lua 5.3 style with braces \u{123}
so that one need only type as few digits as necessary.
This said I think :0x14b:
and :331:
and hopefully {:0x14b:}
and {:331:}
would be a reasonable syntax as an extension of existing emoji syntax (which IMO should include {:emoji:}
) since it might allow processors to support custom names; :entity:
, {:Unicode name:}
or whatever.
I do not care about the syntax here but would like to point out entities are essential for comfort writing of mixed-language texts - e.g. when mixing right-to-left and left-to-right languages as is common e.g. in United Arab Emirates, Qatar, etc.
So any solution you come up in here has to be well readable (and comfortable to write) for characters changing the direction etc.
entities are essential for comfort writing of mixed-language texts - e.g. when mixing right-to-left and left-to-right languages as is common e.g. in United Arab Emirates, Qatar, etc.
Can you explain a bit more why entities help with this? (E.g. give an example?)
Is the purpose for supporting entities to let you put in unicode characters when you're unable to insert the actual unicode character into your source? (That is, you know the character you want but cannot copy/paste it into your content file? Is it common to know the codepoint but not be able to copy/paste the character in?)
:U+2192:
(for "→") is pretty syntax, and symmetric with emoji syntax, but not very readable (unless you happen to know that 2192 mean "→"). Those html entities are more readable :&rarr:
(and potentially easier to remember), though I agree with @marrus-sh about their problems.
I didn't realize that the list of djot-supported emojis was so large. Seems like adding 10 or 20 commonly-used readable unicode char names like :right-arrow:
wouldn't be too crazy, would it?
@uvtc the bigger concern is invisible characters, for example variation selectors, right‐to‐left and left‐to‐right marks, ligation marks (zero‐width joiner and zero‐width non‐joiner), characters which allow breaks (zero‐width space) and prevent them (word joiner), “shy” hyphens, etc…… in some text editors it may be possible to inspect whether these characters are present (CotEditor for example is very good), but in others it may not, and regardless simply having those characters written out in the text is often much easier to handle.
as an example, the codepoint U+3402
㐂 has five different registered variations, which may be indicated by appending the variation selectors U+E0100..U+E0104
. the font you are using when composing your document is not necessarily going to be the same one that you use when rendering it, so it may not support all of the different variants. it would be very useful to be able to write 㐂{:U+E0102:}
to explicitly indicate the third variant, because (depending on fonts etc) the composed form 㐂󠄂 might not look any different than the character without any variations applied.
similar arguments extend to things like wanting to type no‐{:U+2060:}break
to add a word‐joiner to suppress line breaking, etc…
as for having to remember the unicode codepoints as opposed to the names, i think many people probably would prefer writing {:U+E0102:}
rather than {:variation_selector-19:}
; it’s much shorter and easier to skim over in a line of text. in any case, supporting the latter would require unicode character name lookups, which would make implementation a little bit more difficult.
@jgm sorry for the delay - yes, the intent is mostly what @marrus-sh wrote above. Namely to make visible all those characters (incl. future ones) which change or influence overall "style", "form", "layout", "paragraphing", etc.
See my idea in #112 of generalizing the syntax currently used for emojis.
The idea would be that :smile:
is parsed as, say,
special text="smile"
If you use emojis, you can use this syntax for them with a filter that inserts the emoji character proper to the alias. But you could just as easily use a different filter to associate whatever unicode string you like.
I have written a simple Pandoc filter which replaces codepoint escapes like :0x14b:
, :331:
in strings with characters.
Gotcha: a literal colon (:) next to a digit must be escaped as :58:
/:0x3a:
!
local char = utf8.char
local pat = '(%:(%w+)%:)'
local function subst (match, id)
--'If we can numify it it is probably a codepoint!'
local cp = tonumber(id)
if cp then
--'If the codepoint is out of range char throws a scarcely helpful error!'
local ok, res = pcall(char, cp)
if ok then
return res
end
error("Failed to convert " .. tostring(match) .. " to a character:\n\t" .. tostring(res))
end
return match
end
function Str (str)
return pandoc.Str(str.text:gsub(pat, subst))
end
It could easily be ported to a djot filter using my pure-Lua char function from https://github.com/jgm/djot/issues/44#issue-1330859441