kramdown icon indicating copy to clipboard operation
kramdown copied to clipboard

It is not clear how to write a multipoint entity in your entity list

Open StoneCypher opened this issue 2 years ago • 5 comments

Some HTML entities, such as nsubE, are represented as multiple unicode characters (in this case U+2AC5 U+0338.) This is particularly common in math symbols using the slash to strike through symbols.

It is not immediately clear to me how to represent that in the kramdown entity list.

If you could tell me how to represent that one case please, I would happily extend it to the remainder.

StoneCypher avatar Oct 30 '21 20:10 StoneCypher

The conversion from codepoint to character string is done like this:

[code_point].pack('U*')

This will create the correct string representation for any Unicode codepoint. So as long as the entity consists of a single code point, this will work.

Does that clear it up?

gettalong avatar Oct 31 '21 21:10 gettalong

I apologize. That isn't what I meant.

      ENTITY_TABLE = [
        [913, 'Alpha'],
        [914, 'Beta'],
        [915, 'Gamma'],

...

        [213, 'Otilde'],
        [214, 'Ouml'],
        [215, 'times'],

Please pretend for a moment that there was no dedicated capital-O umlaut Ö character. There is, of course; it's U+00D6, represented here as decimal 214. But let's pretend there wasn't.

In Unicode, there is a dedicated combining diaresis, and you can attach it to other characters to construct the character you need. As such, you could make the character with capital O O U+004F then combining diaresis ◌̈ U+0308. We prefer the pre-combined O because fonts trying to typeset symbols above letters typically do a bad job, and sorting is a nightmare, and etc, but, you can actually have an umlaut over whatever, including the poop emoji, if you really want to.

So for a moment, pretend please that I want to rewrite your Ouml rule to emit two codepoints, and construct the Ö instead of using the real one. In this case it's silly, but this is legitimately how quite a few entities (particularly in math) are written. By example, ⫅̸ - Not subset-equal - is written as U+2288, the dedicated math symbol, but really should be written as U+10949 subset equal U+338 negating slash (the logic symbol) instead.

And that's hard to think about, so we're lying, and talking about O umlaut.

If for some stupid reason I wanted to emit U+004F U+0308 for Ouml in this table, how would I do it?

StoneCypher avatar Oct 31 '21 22:10 StoneCypher

I see. This is not possible with how the entities are implemented in kramdown though it is easily doable by just doing [code_point1, code_point2].pack('U*').

As far as I can see, however, all the HTML5 entities are just single-codepoint entities? So this should not be a problem here.

Edit: Sorry, I just looked at the PR and not at the original issue - there you also listed entities with two codepoints. Supporting those entails revamping the entity implementation.

gettalong avatar Nov 01 '21 04:11 gettalong

There are a few.

Name Symbol Codepoint
ncongdot ⩭̸ U+2A6D (10861), U+0338 (824)
nleqslant, nles, NotLessSlantEqual ⩽̸ U+2A7D (10877), U+0338 (824)
ngeqslant, nges, NotGreaterSlantEqual ⩾̸ U+2A7E (10878), U+0338 (824)

There are 65 other than these three.

StoneCypher avatar Nov 01 '21 05:11 StoneCypher

Edit: Sorry, I just looked at the PR and not at the original issue - there you also listed entities with two codepoints. Supporting those entails revamping the entity implementation.

❤️ ❤️ ❤️

Thank you

StoneCypher avatar Nov 01 '21 05:11 StoneCypher