xmlm
xmlm copied to clipboard
Improper indentation of text nodes
The indentation of text nodes introduces non-ignorable whitespaces, making indentation almost useless except for XML structures that contain only elements and attributes.
Used indentation settings:
Xmlm.make_output ~decl:true ~nl:true ~indent:(Some 2) (`Buffer buffer)
Expected output:
<?xml version="1.0" encoding="UTF-8"?>
<test a="1">
<test2 b="2"/>
<test3 c="3">Something</test3>
</test>
Actual output:
<?xml version="1.0" encoding="UTF-8"?>
<test a="1">
<test2 b="2"/>
<test3 c="3">
Something
</test3>
</test>
The situation is even worse if "Something" is produced by two `Data signals "Some" and "thing", in which case additional space is added between text nodes
<?xml version="1.0" encoding="UTF-8"?>
<test a="1">
<test2 b="2"/>
<test3 c="3">
Some
thing
</test3>
</test>
Note 1:
I'm aware that this behavior is exactly as documented:
If ident is Some c, each Xmlm.signal is output on its own line
Nevertheless, I'd like to point out that this behavior doesn't make a lot of sense for text nodes, as it would mean that the only correct way of outputting large text blocks is to emit a single large string, which alleviates some of the benefits of a streaming interface.
Instead, I'd propose as a generate rule of thumb that multiple Data signals should always result in exactly the same output as one single combined Data signal.
Note 2:
I'm also aware that the XML spec is pretty vague with regard to white spaces. But from experiences and the best practices that emerged over the last 20 years, I'd like to express the following principles about XML formatting (I'm pretty sure there is a formal statement about those, i.e. some consensus about this "ignorable whitespace" topic, but I can't find that right now):
- It is fine to not indent attributes at all. (But if you do, there is some disagreement about whether to indent those by one or two levels.)
- If an element node contains only sub element nodes but no text nodes, indent those by one level.
- If an element node contains only sub text nodes, do not indent and do not add any whitespaces.
- If an element node is mixed, i.e. contains sub elements as well as sub text nodes, do not indent and do not add any whitespaces for the whole subtree (i.e. also not for nested elements).
Example 1:
<foo a="1">
<bar b="2"/>
<bar c="3" d="4">Something else</bar>
<bar c="3">
<foo c="5" d="7"><x>So</x><x>me</x> mixed<y z="1"><z> con</z><z>tents</z></y></foo>
</bar>
<bar c="3">Something</bar>
</foo>
I'm aware that the last principle is very hard to achieve with streaming interfaces. On the other hand, also note that this affects essentially only DocBook and XHTML, and mixed elements are practically never used in any other XML formats. Nevertheless, there are two common solutions for that:
- Either adjust the streaming interface to distinguish between normal and mixed contents (and enforce that).
- Or indent all sub elements until the first appearance of a text node, then don't indent or add any other whitespaces for the whole remaining subtree.
Example 2:
<foo a="1">
<bar b="2"/>
<bar c="3" d="4">Something else</bar>
<bar c="3">
<foo c="5" d="7">
<x>So</x>
<x>me</x> mixed<y z="1"><z> con</z><z>tents</z></y></foo>
</bar>
<bar c="3">Something</bar>
</foo>
Note 3:
Finally, I'd like to mention that when in doubt about XML formatting, comparing with the output of xmllint --format (the command line tool of the libxml library) is always a good idea, as that tool does many things right, and is widely used, and hence has received a lot of attention over time to match the real-world needs. Also note that other XML tools/libraries, such as Xerces, etc., handle whitespaces in roughly the same way.
Yes generic XML is broken w.r.t. whitespace (or rather people use XML for the wrong purpose) which makes it difficult to devise generic pretty printing routines.
What are you suggesting exactly ?
I'd suggest to change the Xmlm indentation to work as described above.
I believe the simplest way from Xmlm's current state to there is to implement the following changes:
- Never indent text nodes.
- When the first text node within in an element is encountered, disable indentation until that element is closed
- as a consequence, indentation of all remaining subelements and their subelements, i.e. the whole remaining subtree of that element, is not indented. This is a good thing.
- In other words, implement Example 2 and not Example 1.
Does that help, or did I leave any important special cases unspecified?
For compatibility reasons it's unlikely I'm going to change the current behaviour.
What you propose is different, but I'm not necessarily convinced it's so much better in general. Again, it's difficult to devise a generic pretty printing for generic XML without domain specific knowledge.
In any case I want to recall that, as mentioned in the tips, you can always implement an arbitrary pretty-printing behaviour suitable for your data format by using ~indent:None and appropriate Data signals. That, in my opinion is the best strategy to follow if you are unhappy about indent's strategy.
Maybe you want to start implementing your strategy that way as a stateful stream transformer.
Thanks for the hint. This is indeed what we already did to solve the indentation issue.
The purpose of the above detailed explanation about whitespacing in XML was to contribute something back to Xmlm.
In case you reconsider improving the indentation behavior of Xmlm, I'd like to propose an additional optional argument. This argument could be:
- a boolean flag to disable indentation of text/mixed nodes (defaulting to
false) for backwards compatibility, - or, if a more general solution is preferred, it could be a polymorphic variant argument to select one of multiple indentation strategies.