HTML: Use HTML intermediate language before final `Doc` rendering
Rationale
As @JacquesCarette mentioned in this comment, the HTML backend currently "tags" references by wrapping the referenced content in a new <div> tag. In most cases, the wrapped DOM tree does not have an id attribute making the <div> redundant as the id attribute could be folded into the wrapped DOM tree.
For example:
<div id="abc">
<ul>
...
</ul>
</div>
is currently generated, but we could instead do:
<ul id="abc">
...
</ul>
because the <ul> had no id.
Complications
Currently, the HTML backend takes a LayoutObj and immediately serializes it to text effectively "preventing" any additional modification to that element. In most cases this is produces (nearly) standards compliant output, but sub-optimal and bulkier output than needed.
Proposal
Structure
A solution which would be adequate for the HTML backend is to add a simple step between LayoutObj and Doc. It may look something like:
data DomElem = Elem String (Maybe String) [String] [Attrs] [DomElem] | Content String
Where
String is the tag. (ex. div)
Maybe String is the id. (ex. assump9)
[String] is a list of class names. (ex. ["code", "bases"])
[Attrs] is, loosely speaking, a list of key-value pairs for other element specific attributes. (ex. href="#top" for <a>)
[DomElem] is a list of wrapped DOM elements. More information on the purpose of this below, but allows for functions to wrap sub-trees as needed.
And
Content is means to output text verbatim.
Functions
The goal of adding this intermediary step is to enhance convenience without much code clutter while producing better output. For convenience, I will use existing function names (where possible) from the HTML backend to bridge what the back end currently does and how that would transition to use this added structure.
-
wrap :: String -> [String] -> DomElem -> DomElemSmall type signature change from what it currently is. Currently the last two arguments areDoc. This would do the same thing it currently does, creates a new tag with[String]classes and aDomElembody. -
A lot of places in the backend do not apply classes to the tags leaving
[]all over. It would be nice to add:wrap' :: String -> DomElem -> DomElemWhich does the same aswrapbut removes the need to specify an empty list. -
refWrap :: String -> DomElem -> DomElemrefWraptakes anidand tries to apply it to aDomElem. If theDomElemhas anidrefWrapwould wrap theDomElemin a<div>and apply theidto that, otherwise apply it directly to theDomElem. -
addClasses :: [String] -> DomElem -> DomElemAdds class names to aDomElem. Would allow for simplification ofmakeList. -
addAttrs :: Attrs -> DomElem -> DomElemAdds tag specific attributes to aDomElem -
toText :: DomElem -> DocSerializes aDomElemto a document. Can be a centralized place to properly omit end tags for void elements
Alternatives
- Do nothing. Leave the HTML backend alone and continue to produce sub-optimal and slightly standards non-compliant output.
- Instead of
DomElemuse currying to partially apply functions and defer string output until more is known. (Not completely sure if this would work in all cases, but was a thought I had when considering alternatives). - Modify
refWrapto do string searching to see if anidattribute is present on the top level tag. (Hacky)
Definitely Alternative 3 is right out.
I mostly like this proposal. Before committing to it though, I would like to investigate if backing up one level would be useful. Right now the definition of the data-structure that the HTML backend takes as input is somewhat ad-hoc. Could it be modified so that HTML generation (with ids in the right places, etc) would be easier? What would that look like?
Another thing to look at is pandoc. We're been thinking of having more backends by going to pandoc instead. So before too much effort is spent on the HTML backend, I think it is worth taking a good close look.
LayoutObj
I do think some issues could be mitigated by improving LayoutObj. You mentioned ALUR is a sort of "hack" last Tuesday. As a result of this, both the TeX and HTML backend suffer. Both backends wrap each ALUR entirely in their own list. (See issue #470 for more info)
As for upper level improvements helping with "tag folding" in HTML, the references (and thus id attributes) are a part of LayoutObj and thus known (and used) when serializing already. The existing design breaks up code generation and ensure that and id is never added to an element which already has an id. There are two cases in the HTML backend where refWrap is not applied immediately to wrap. Both are wrappers for it though at a "shallow" level. (I misread the code initially thinking more was done in these cases than just simple wrappers of wrap, so my second alternative should be as mentioned in the next paragraph instead.)
That makes it possible (in the current implementation) to add something like wrap' :: String -> String -> [String] -> Doc -> Doc. Additionally, in this function it would be possible to check for an empty id string and omit that as well. This may be a preferable "proposal" especially if there is (or likely to be) a plan to migrate to Pandoc.
Pandoc
While I haven't looked at it too much, Pandoc seems to support math by means of LaTeX literal math code (doesn't seem like a real problem, but definitely something to note). Another thing to note, is if we generate directly to Pandoc AST there doesn't appear to be a way to "modify" lists the way we do in both HTML and TeX. In TeX, itemize lists currently swap the bullet for prefixes. In HTML the prefixes are hidden and the prefix is prepended to the content of the item. Again, this is not necessarily an issue and may be solved. But at an AST level it does not look like there is an option to hide the bullets in Pandoc.
Another thing to note is, while trying Pandoc on the (terminal) command line, it chokes within the dmath environment. Specifically in the sequence \int_ it doesn't expect _ and expects \end{dmath}. A quick skim through the generation of HTML looks like it may only support inline math by means of $$.
Ok, let's not rush to Pandoc then. We might still do it, but maybe as a 3rd printer rather than as the only one.
In the meantime, it is indeed best to fix the design of LayoutObj. It should really be a "language of abstract layout" that contains all the necessary information to translate to either html or LaTeX, but no 'semantics'. ALUR is all semantics. Take a look at Language.Drasil.Printing.AST as an example of a layout language (for math) that seems decent enough.
Note that pulling ALUR out will require some non-trivial design. We want to move some of the concepts (such as 'Assumption') out of Language.Drasil out to Data.Drasil. [That in itself will likely require that we split our package into several sub-packages. cabal doesn't like to have multiple libraries in a single file.]
This may not end up in scope for @Mornix to do, but I'll still re-assign. I think once ALUR is dealt with, the rest of LayoutObj and the rendering process can easily be modified as documented above.