libasciidoc Parse the HTML entities and inline them

(This might be controversial).

It would be a good thing, given that we have a full Unicode (UTF-8) backend, if we could parse HTML entities (both numeric, and some of the named ones commonly used for HTML) and convert those entities to their UTF-8. These could then be passed around through the context as their UTF-8 text.

One big benefit to this is that for backends that don't deal with HTML or SGML (think in the future like PDF or PostScript), or when using a UTF-8 clean backend (ePub), we would actually have the real unicode.

Jun 30 '20 21:06 gdamore

Likeable idea. If it is controversial, would a feature flag make sense?

Jul 09 '20 18:07 pjanx

This request is more or less related to the support for custom substitutions. One of them is "special characters" and it's supposed to convert characters such as <, > and & to their HTML equivalent. However, I don't believe that this substitution should happen during the parsing, when the "draft document" is processed, because I doubt that other output format such as PDF, etc will want to deal with such HTML entities. So, I'm thinking about having a substitution phase during the rendering. Or maybe it's just a matter of using html.Template vs text.Template to enable or disable this particular substitution 🤔

Jul 14 '20 19:07 xcoulon

So, I really think the parsed document structure should contain literal values -- not entities. For example, we can use " for a quote. The fact that " needs to be converted to an entity should be done in the renderer.

Likewise for other characters -- e.g. (C) should be converted to the copyright symbol unicode value -- and we can turn that into a numeric entity in the renderer (or possibly leave it along, if it isn't a problem.)

Where it gets tricky is that we also allow end users to specify an HTML entity in the markup (that is, HTML entities are documented as part of the supported ASCIIDOC syntax).

Those entities should be parsed, and converted into their character equivalents in the draft document. They might then afterwards be converted back into entities, or not, depending on the renderer.

Jul 14 '20 19:07 gdamore

And yes, I think we should consider using html/tempate in the backend. I can start looking more closely at that if you agree.

Jul 14 '20 19:07 gdamore

So, I really think the parsed document structure should contain literal values -- not entities. For example, we can use " for a quote. The fact that " needs to be converted to an entity should be done in the renderer.

yes, that's my thinking as well

Likewise for other characters -- e.g. (C) should be converted to the copyright symbol unicode value -- and we can turn that into a numeric entity in the renderer (or possibly leave it along, if it isn't a problem.)

yes, it's already done during the rendering

Where it gets tricky is that we also allow end users to specify an HTML entity in the markup (that is, HTML entities are documented as part of the supported ASCIIDOC syntax).

Those entities should be parsed, and converted into their character equivalents in the draft document. They might then afterwards be converted back into entities, or not, depending on the renderer.

For now, I believe that the html.Template already does a decent job with converting characters such as >, etc. to HTML entities

But reading again the issue description, I'm actually starting to think that replacing the HTML entities with their UTF-8 equivalent (eg: \u003E instead of > for the > character) could be a solution that works for all backends 🤔

Jul 14 '20 19:07 xcoulon

And yes, I think we should consider using html/template in the backend. I can start looking more closely at that if you agree.

We used to use them to process HTML entities in some of the template fields in the past, but I believe that this was changed with the recent sgml/html5/xhtml5 refactoring...

Jul 14 '20 19:07 xcoulon

well, anyways, given that we have the parser on the one hand and multiple renderers on the other hand, it's likely that we can't (and should not) perform the "special characters" substitution (>, etc.) at the parser level, but rather, keep that for the renderers which need it (HTML5 and XHTML5 but probably not PDF in the future).

As far as the "replacements" substitution is concerned (eg: copyright (C)), we already convert such characters them into Unicode during the parsing: https://github.com/bytesparadise/libasciidoc/blob/master/pkg/parser/parser.peg#L2102-L2118

I guess we can keep working like that for now. In other words, having the the "special characters" substitution executed during rendering, while others happen earlier

Jul 14 '20 21:07 xcoulon

I hope what you mean is that we convert things like '>' to entities in the rendering stage, and leave them unmolested during parsing.

I do think we should parse html entities in the parser stage as well. I don't think I busted that, but I haven't tested.

Jul 14 '20 23:07 gdamore

I hope what you mean is that we convert things like '>' to entities in the rendering stage, and leave them unmolested during parsing.

yes, that's what I meant, indeed

I do think we should parse html entities in the parser stage as well. I don't think I busted that, but I haven't tested.

My apologies, I stand corrected. It's now the sanitized type which aliases the html.HTML type and takes care of substituting the special characters (>, <, &).

I wonder why we should parse these entities in the parser stage as well? What would be the benefit? (I can see more time spent in the parser)

Jul 15 '20 06:07 xcoulon

@gdamore I believe that #734 fixes this issue.

Jul 19 '20 08:07 xcoulon

I think it probably does.

Aug 02 '20 06:08 gdamore