snudown
snudown copied to clipboard
Allow all entities from HTML5 spec.
Hello,
This post in /r/RESissues reported that the list of allowed HTML entities was incomplete. To resolve this, I implemented code in setup.py
to build the list of allowed entities dynamically from the JSON file released as part of the HTML5 specification.
Notes:
- The JSON is included in the repository instead of being automatically downloaded from the W3C. I thought having
setup.py
download the JSON file automatically would be a nice touch, but would violate the principle of least astonishment. - Certain older entities are, for backwards compatibility reasons, recognized even without semicolons. These are not supported.
- The produced gperf input is piped directly into
gperf
and is not stored in a file. - I fixed the
setup.py
script to respond to changes to thehtml_entities.h
by rebuilding the extension, as expected. In addition, the header file will only be rebuilt if any of the files used to generate it change. - I admit that this solution is more complicated than simply updating the list in
html_entities.gperf
. However, it should have the advantage of less work for future updates (e.g. HTML5.1) — simply updatehtml_entities.json
with the copy from the newer HTML specification.
Thank you for your consideration of this pull request.
Hey /u/nandhp, thanks for the PR!
The gperf changes look good, but I'm going to have to think about this one a bit. Up until now we've tried to restrict snudown's output to valid XHTML1 for a few reasons.
One is that we use snudown's output verbatim in RSS feeds, and even though XHTML1's named entities aren't valid in either RSS or Atom, they're well-supported by most feed readers.
The other is that we currently support browsers and parsers that don't understand that ✓
== ✓
. Decoding the character entities is a nightmare for mobile apps, but for the most part they understand XHTML1's named entities, as well as numeric entities like ✓
.
Here's a thought.
Reddit must perform transformations (Snudown -> min(X{,HT}ML)) on the data users submit in order to present it for display. What if "&checkmark
" -> "✓
" was added to the standard set of transformations? Users who know HTML entities can use them, and the likelihood of parsers choking is lowered.
@i336
Yep, doing that in gperf would make sense. Right now we're just using gperf to check for membership in a set, but I believe it could be changed to do an entity name -> decimal representation
lookup.
Converting non-XHTML entities to numerics was actually pretty easy to implement, since the JSON from W3 includes the codepoints for each entity.
I'm not quite sure why there's two output paths -- rndr->cb.entity
and bufput
-- so I just copied from below. It looks like the rndr->cb.entity
path isn't tested, though, so I'm not sure I got it right.
When/if you are ready to accept this pull request, let me know and I'll squash these commits together so that the html_entities.gprof
file isn't rewritten twice.
Ahhhh, I see what I was missing now. XHTML 1 entities will be output verbatim as before, but HTML5 entities will be output hex-encoded.
Overall, this looks pretty good! My only other thought is that the function's flow is a bit hard to read.
Right now it's something like:
if is_numeric:
if not is_valid_numeric_entity(entity):
return
else:
resolved_entity = resolve_character_entity(entity)
if not resolved_entity:
return
# This is entity is not supposed to be printed verbatim
if resolved_entity.codepoints[0]:
print the codepoints as entities
return end
handle printing of numeric entities and entities that may be printed verbatim
return end
where the in between the validation and output steps for numeric entities and named entities that may be output verbatim, we have the verification and output for named entities that must be output as numeric entities.
IMO it'd be better to have a clean split between the verification and output steps, like
resolved_entity = None
output_raw_entity = False
if is_numeric:
if not is_valid_numeric_entity(entity):
return
output_raw_entity = True
else:
resolved_entity = resolve_character_entity(entity)
if not resolved_entity:
return
output_raw_entity = (resolve_entity.codepoints[0] == 0)
if output_raw_entity:
print the entity as-is
else:
assert(resolved_entity)
print the codepoints as numeric entities
return end
:eyeglasses: @JordanMilne @spladug (Feel free to duck out! Just thought this could use some extra C-understanding eyes.)
Hi,
Thank you for taking such a detailed look at the code!
Yes, the logic is a little convoluted. I suppose I did it that way to avoid excessive changes to the existing codepaths, as well as to minimize changes the entities themselves. Since numeric entities have to be partially rewritten anyway (to normalize &#X
to &#x
), I suppose it would be fine to completely rewrite them, which would allow them to use the same output path. I guess the resulting code would look something like this:
resolved_entity = None
output_entities = [0, 0]
if is_numeric:
entity_value = strtol(input_entity, entity_base)
if not is_valid_numeric_entity(entity_value):
return
output_entities = [entity_value, 0]
else:
resolved_entity = resolve_character_entity(input_entity)
if not resolved_entity:
return
if resolved_entity.codepoints[0] == 0:
output_entities = [-1, 0] # Dummy numeric for raw entity output
else:
output_entities = resolved_entity.codepoints
for entity_value in output_entities where entity_value != 0:
char rendered_entity[MAX_NUM_ENTITY_LEN + 5]
char *buf
if entity_value == -1:
# Output raw entity
buf = input_entity
buf_len = end
else:
# Generate numeric entity
buf_len = sprintf(rendered_entity, "&#x%x;", entity_value)
buf = rendered_entity
if callback:
callback(buf)
else:
bufput(buf, buf_len)
return end
Does that seem good? In this pseudocode, I use an entity value of -1
to mean "actually, just output the raw entity", which seems like it might be considered slightly unclear or some kind of abuse of data type.
I suppose I did it that way to avoid excessive changes to the existing codepaths, as well as to minimize changes the entities themselves
IMO not changing the existing entities was the right approach. Some consumers of Snudown's rendered HTML do iffy things, like use hand-rolled HTML parsers and manually decode HTML entities like paragraph.replace("&", "&").replace("<", "<")...
We wouldn't want to break them all of a sudden for stuff that previously worked, but if they can't handle the new stuff it's their fault for not conforming to the spec :smile:.
Since numeric entities have to be partially rewritten anyway (to normalize &#X to &#x), I suppose it would be fine to completely rewrite them, which would allow them to use the same output path. I guess the resulting code would look something like this [...] Does that seem good? In this pseudocode, I use an entity value of -1 to mean "actually, just output the raw entity", which seems like it might be considered slightly unclear or some kind of abuse of data type.
Ehhh, that still feels a little confusing to me. The use of a magic value is a good indicator that the raw entity stuff doesn't really fit in the loop. Ultimately we have 3 types of entities we want to render: numeric entities, named entities which may be output verbatim, and named entities which must have their endpoints output as numeric entities. Making numeric entities share a code path with the named entity output stuff is a little confusing as well.
I forgot to mention this, but the rndr->cb.entity
stuff can stay as it is in master
. That's just a hook for people extending snudown to render entity's we've ok'd in some other way. Maybe they want to render ✓
as a unicode snowman :snowman:.
So I'm thinking something like:
if (rndr->cb.entity) {
work.data = data;
work.size = end;
rndr->cb.entity(ob, &work, rndr->opaque);
} else {
// maybe instead of looking at entities[0] we could add an
// `.output_encoded` member to the struct? The gperf table's
// pretty small, so I doubt it'd cause any performance issues.
if ( resolved_entity && resolved_entity.entities[0] ) {
// ... loop over / output entities
}
/* Necessary so we can normalize `>` to `>` */
bufputc(ob, '&');
if (numeric)
bufputc(ob, '#');
if (hex)
bufputc(ob, 'x');
bufput(ob, data + content_start, end - content_start);
}
IMO that'd be a bit clearer, and we keep the benefit of only having to do a single string write for numeric entities and entities which may be output verbatim.
By passing the input directly to rndr->cb.entity
and adding the extra variable to the struct (which is actually a counter), the code suddenly becomes very simple. The diff (for the C code) now comes down to this, which I think is now incredibly clear and makes even fewer adjustments to the codepath:
diff --git a/src/markdown.c b/src/markdown.c
index abe4a1d..ecf46a5 100644
--- a/src/markdown.c
+++ b/src/markdown.c
@@ -721,6 +721,7 @@ char_entity(struct buf *ob, struct sd_markdown *rndr, uint8_t *data, size_t max_
int hex = 0;
int entity_base;
uint32_t entity_val;
+ const struct html_entity *named_entity = NULL;
if (end < size && data[end] == '#') {
numeric = 1;
@@ -769,7 +770,9 @@ char_entity(struct buf *ob, struct sd_markdown *rndr, uint8_t *data, size_t max_
if (!is_valid_numeric_entity(entity_val))
return 0;
} else {
- if (!is_allowed_named_entity((const char *)data, end))
+ /* Lookup the entity in the named entity table. */
+ named_entity = is_allowed_named_entity((const char *)data, end);
+ if (!named_entity)
return 0;
}
@@ -777,6 +780,17 @@ char_entity(struct buf *ob, struct sd_markdown *rndr, uint8_t *data, size_t max_
work.data = data;
work.size = end;
rndr->cb.entity(ob, &work, rndr->opaque);
+ } else if (named_entity && named_entity->output_numeric) {
+ /* Output the named entity as numeric entities. */
+ int i;
+ assert(named_entity->output_numeric <= MAX_ENTITY_CODEPOINTS);
+ for (i = 0; i < named_entity->output_numeric; i++) {
+ /* Verify the codepoint is a valid entity. */
+ entity_val = named_entity->codepoints[i];
+ assert(is_valid_numeric_entity(entity_val));
+ /* Render codepoint to an entity. */
+ bufprintf(ob, "&#x%X;", entity_val);
+ }
} else {
/* Necessary so we can normalize `>` to `>` */
bufputc(ob, '&');
Yep, that seems sane to me. Only thing is I'd change is_allowed_named_entity
to something like resolve_named_entity
now its purpose has changed
I'm going to fuzz this and have someone else take a look to make sure that I didn't miss anything, but other than a couple nits it looks good to me! Thanks!
I renamed the lookup function. Let me know if you have any other feedback. I would also be happy to squish the commits down to just one or two, if for no other reason than the html_entities.gperf
file had its list of entities gratuitously deleted and reinserted (and it seems unnecessary to keep both of those diffs).