snudown Allow all entities from HTML5 spec.

Hello,

This post in /r/RESissues reported that the list of allowed HTML entities was incomplete. To resolve this, I implemented code in setup.py to build the list of allowed entities dynamically from the JSON file released as part of the HTML5 specification.

Notes:

The JSON is included in the repository instead of being automatically downloaded from the W3C. I thought having setup.py download the JSON file automatically would be a nice touch, but would violate the principle of least astonishment.
Certain older entities are, for backwards compatibility reasons, recognized even without semicolons. These are not supported.
The produced gperf input is piped directly into gperf and is not stored in a file.
I fixed the setup.py script to respond to changes to the html_entities.h by rebuilding the extension, as expected. In addition, the header file will only be rebuilt if any of the files used to generate it change.
I admit that this solution is more complicated than simply updating the list in html_entities.gperf. However, it should have the advantage of less work for future updates (e.g. HTML5.1) — simply update html_entities.json with the copy from the newer HTML specification.

Thank you for your consideration of this pull request.

/u/nandhp

Sep 19 '15 04:09 nandhp

Hey /u/nandhp, thanks for the PR!

The gperf changes look good, but I'm going to have to think about this one a bit. Up until now we've tried to restrict snudown's output to valid XHTML1 for a few reasons.

One is that we use snudown's output verbatim in RSS feeds, and even though XHTML1's named entities aren't valid in either RSS or Atom, they're well-supported by most feed readers.

The other is that we currently support browsers and parsers that don't understand that &checkmark; == ✓. Decoding the character entities is a nightmare for mobile apps, but for the most part they understand XHTML1's named entities, as well as numeric entities like ✓.

Sep 19 '15 05:09 JordanMilne

Here's a thought.

Reddit must perform transformations (Snudown -> min(X{,HT}ML)) on the data users submit in order to present it for display. What if "&checkmark" -> "✓" was added to the standard set of transformations? Users who know HTML entities can use them, and the likelihood of parsers choking is lowered.

Sep 19 '15 07:09 i336

@i336

Yep, doing that in gperf would make sense. Right now we're just using gperf to check for membership in a set, but I believe it could be changed to do an entity name -> decimal representation lookup.

Sep 19 '15 08:09 JordanMilne

Converting non-XHTML entities to numerics was actually pretty easy to implement, since the JSON from W3 includes the codepoints for each entity.

I'm not quite sure why there's two output paths -- rndr->cb.entity and bufput -- so I just copied from below. It looks like the rndr->cb.entity path isn't tested, though, so I'm not sure I got it right.

When/if you are ready to accept this pull request, let me know and I'll squash these commits together so that the html_entities.gprof file isn't rewritten twice.

Sep 19 '15 14:09 nandhp

Ahhhh, I see what I was missing now. XHTML 1 entities will be output verbatim as before, but HTML5 entities will be output hex-encoded.

Sep 20 '15 04:09 JordanMilne

Overall, this looks pretty good! My only other thought is that the function's flow is a bit hard to read.

Right now it's something like:

if is_numeric:
    if not is_valid_numeric_entity(entity):
        return
else:
    resolved_entity = resolve_character_entity(entity)
    if not resolved_entity:
        return
    # This is entity is not supposed to be printed verbatim
    if resolved_entity.codepoints[0]:
        print the codepoints as entities
        return end

handle printing of numeric entities and entities that may be printed verbatim
return end

where the in between the validation and output steps for numeric entities and named entities that may be output verbatim, we have the verification and output for named entities that must be output as numeric entities.

IMO it'd be better to have a clean split between the verification and output steps, like

resolved_entity = None
output_raw_entity = False

if is_numeric:
    if not is_valid_numeric_entity(entity):
        return
    output_raw_entity = True
else:
    resolved_entity = resolve_character_entity(entity)
    if not resolved_entity:
        return
    output_raw_entity = (resolve_entity.codepoints[0] == 0)

if output_raw_entity:
    print the entity as-is
else:
    assert(resolved_entity)
    print the codepoints as numeric entities
return end

Sep 20 '15 08:09 JordanMilne

:eyeglasses: @JordanMilne @spladug (Feel free to duck out! Just thought this could use some extra C-understanding eyes.)

Sep 20 '15 08:09 JordanMilne

Hi,

Thank you for taking such a detailed look at the code!

Yes, the logic is a little convoluted. I suppose I did it that way to avoid excessive changes to the existing codepaths, as well as to minimize changes the entities themselves. Since numeric entities have to be partially rewritten anyway (to normalize &#X to &#x), I suppose it would be fine to completely rewrite them, which would allow them to use the same output path. I guess the resulting code would look something like this:

resolved_entity = None
output_entities = [0, 0]

if is_numeric:
    entity_value = strtol(input_entity, entity_base)
    if not is_valid_numeric_entity(entity_value):
        return
    output_entities = [entity_value, 0]
else:
    resolved_entity = resolve_character_entity(input_entity)
    if not resolved_entity:
        return
    if resolved_entity.codepoints[0] == 0:
        output_entities = [-1, 0] # Dummy numeric for raw entity output
    else:
        output_entities = resolved_entity.codepoints

for entity_value in output_entities where entity_value != 0:
    char rendered_entity[MAX_NUM_ENTITY_LEN + 5]
    char *buf
    if entity_value == -1:
        # Output raw entity
        buf = input_entity
        buf_len = end
    else:
        # Generate numeric entity
        buf_len = sprintf(rendered_entity, "&#x%x;", entity_value)
        buf = rendered_entity

    if callback:
        callback(buf)
    else:
        bufput(buf, buf_len)
return end

Does that seem good? In this pseudocode, I use an entity value of -1 to mean "actually, just output the raw entity", which seems like it might be considered slightly unclear or some kind of abuse of data type.

Sep 20 '15 13:09 nandhp

I suppose I did it that way to avoid excessive changes to the existing codepaths, as well as to minimize changes the entities themselves

IMO not changing the existing entities was the right approach. Some consumers of Snudown's rendered HTML do iffy things, like use hand-rolled HTML parsers and manually decode HTML entities like paragraph.replace("&", "&").replace("<", "<")... We wouldn't want to break them all of a sudden for stuff that previously worked, but if they can't handle the new stuff it's their fault for not conforming to the spec :smile:.

Since numeric entities have to be partially rewritten anyway (to normalize &#X to &#x), I suppose it would be fine to completely rewrite them, which would allow them to use the same output path. I guess the resulting code would look something like this [...] Does that seem good? In this pseudocode, I use an entity value of -1 to mean "actually, just output the raw entity", which seems like it might be considered slightly unclear or some kind of abuse of data type.

Ehhh, that still feels a little confusing to me. The use of a magic value is a good indicator that the raw entity stuff doesn't really fit in the loop. Ultimately we have 3 types of entities we want to render: numeric entities, named entities which may be output verbatim, and named entities which must have their endpoints output as numeric entities. Making numeric entities share a code path with the named entity output stuff is a little confusing as well.

I forgot to mention this, but the rndr->cb.entity stuff can stay as it is in master. That's just a hook for people extending snudown to render entity's we've ok'd in some other way. Maybe they want to render &checkmark; as a unicode snowman :snowman:.

So I'm thinking something like:

    if (rndr->cb.entity) {
        work.data = data;
        work.size = end;
        rndr->cb.entity(ob, &work, rndr->opaque);
    } else {
        // maybe instead of looking at entities[0] we could add an
        // `.output_encoded` member to the struct? The gperf table's
        // pretty small, so I doubt it'd cause any performance issues.
        if ( resolved_entity && resolved_entity.entities[0] ) {
            // ... loop over / output entities
        }
        /* Necessary so we can normalize `&#X3E;` to `&#x3E;` */
        bufputc(ob, '&');
        if (numeric)
            bufputc(ob, '#');
        if (hex)
            bufputc(ob, 'x');
        bufput(ob, data + content_start, end - content_start);
    }

IMO that'd be a bit clearer, and we keep the benefit of only having to do a single string write for numeric entities and entities which may be output verbatim.

Sep 20 '15 15:09 JordanMilne

By passing the input directly to rndr->cb.entity and adding the extra variable to the struct (which is actually a counter), the code suddenly becomes very simple. The diff (for the C code) now comes down to this, which I think is now incredibly clear and makes even fewer adjustments to the codepath:

diff --git a/src/markdown.c b/src/markdown.c
index abe4a1d..ecf46a5 100644
--- a/src/markdown.c
+++ b/src/markdown.c
@@ -721,6 +721,7 @@ char_entity(struct buf *ob, struct sd_markdown *rndr, uint8_t *data, size_t max_
        int hex = 0;
        int entity_base;
        uint32_t entity_val;
+       const struct html_entity *named_entity = NULL;

        if (end < size && data[end] == '#') {
                numeric = 1;
@@ -769,7 +770,9 @@ char_entity(struct buf *ob, struct sd_markdown *rndr, uint8_t *data, size_t max_
                if (!is_valid_numeric_entity(entity_val))
                        return 0;
        } else {
-               if (!is_allowed_named_entity((const char *)data, end))
+               /* Lookup the entity in the named entity table. */
+               named_entity = is_allowed_named_entity((const char *)data, end);
+               if (!named_entity)
                        return 0;
        }

@@ -777,6 +780,17 @@ char_entity(struct buf *ob, struct sd_markdown *rndr, uint8_t *data, size_t max_
                work.data = data;
                work.size = end;
                rndr->cb.entity(ob, &work, rndr->opaque);
+       } else if (named_entity && named_entity->output_numeric) {
+               /* Output the named entity as numeric entities. */
+               int i;
+               assert(named_entity->output_numeric <= MAX_ENTITY_CODEPOINTS);
+               for (i = 0; i < named_entity->output_numeric; i++) {
+                       /* Verify the codepoint is a valid entity. */
+                       entity_val = named_entity->codepoints[i];
+                       assert(is_valid_numeric_entity(entity_val));
+                       /* Render codepoint to an entity. */
+                       bufprintf(ob, "&#x%X;", entity_val);
+               }
        } else {
                /* Necessary so we can normalize `&#X3E;` to `&#x3E;` */
                bufputc(ob, '&');

Sep 21 '15 01:09 nandhp

Yep, that seems sane to me. Only thing is I'd change is_allowed_named_entity to something like resolve_named_entity now its purpose has changed

Sep 21 '15 04:09 JordanMilne

I'm going to fuzz this and have someone else take a look to make sure that I didn't miss anything, but other than a couple nits it looks good to me! Thanks!

Sep 21 '15 04:09 JordanMilne

I renamed the lookup function. Let me know if you have any other feedback. I would also be happy to squish the commits down to just one or two, if for no other reason than the html_entities.gperf file had its list of entities gratuitously deleted and reinserted (and it seems unnecessary to keep both of those diffs).

Sep 21 '15 20:09 nandhp

snudown snudown copied to clipboard

Allow all entities from HTML5 spec.

snudown
snudown copied to clipboard