tinyxml2 icon indicating copy to clipboard operation
tinyxml2 copied to clipboard

Pedantic white space preservation not supported.

Open jodyp12 opened this issue 11 years ago • 6 comments

In the following xml the whitespace in text, which happens to be a space character, is stripped. <tspan font-weight="bold"> </tspan>

This happens when XMLNode::ParseDeep() calls XMLDocument::Identify(), which in turn calls XMLUtil::SkipWhiteSpace().

If a character comes anywhere after the whitespace Identify() correctly creates a text element and backs up to the 1st character, correctly keeping the space character along with the following character. <tspan font-weight="bold"> a</tspan>

The result is that whitespace is not fully preserved in text - which doesn't match the documentation. This example isn't just an exercise, it's an actual shipstopper when reading legacy files in a well known application that has been migrated to tinyxml2.

jodyp12 avatar Dec 11 '14 21:12 jodyp12

I'm sorry this isn't what you want. The behavior is documented in the README, although the specific case you have isn't clear. (The rule also is applied within an element, and the formatting example could be clearer that it includes both end of line normalization and whitespace.)

The best approach for TinyXML-2 has been discussed before, and this is a case where TinyXML-2 is intentionally choosing the generally more useful yet non-compliant behavior.

If you want to submit a pull request for a new behavior (PEDANTIC_WHITESPACE maybe?) it would be a worthwhile integration if it doesn't add too much code complexity.

leethomason avatar Dec 11 '14 23:12 leethomason

Next example shows that whitespace only is not preserved:

#include "tinyxml2.h"

using namespace tinyxml2;

int main( int argc, const char ** argv )
{
    // leading and trailing whitespace is preserved
    static const char* test1 = "<element>  leading and trailing whitespace   </element>";
    XMLDocument doc;
    doc.Parse( test1 );
    doc.Print();

    // whitespace only is not preserved !!
    static const char* test2 = "<element2>      </element2>";
    XMLDocument doc2;
    doc2.Parse( test2 );
    doc2.Print();

    return 0;
}

Gives output:

<element>  leading and trailing whitespace   </element>
<element2/>

peterbiglr avatar Mar 09 '15 19:03 peterbiglr

Leaving open in case someone wants to submit a patch for this. TinyXML2 is working as intended; it would need a new whitespace mode to fix.

leethomason avatar Mar 15 '15 23:03 leethomason

I agree that a new whitespace preservation option is needed, because currently legitimate HTML like this, fails to be parsed as expected. This: <p><span class=\"class1\">formatted text with</span> <a href=\"\">link</a></p>

is printed as: <p><span class=\"class1\">formatted text with</span><a href=\"\">link</a></p> which is loss of meaningful information.

I am trying to patch it myself, but so far, I can't manage to do it, because to work properly, such PEDANTIC_WHITESPACE option requires context knowledge of the surrounding nodes (whitespace should be interpreted as text only if it is inside the <body> tag, no in the <head>.

petko avatar Apr 01 '15 10:04 petko

@ minimum, should support xml:space="preserve", as mentioned @ https://github.com/JayXon/Leanify/issues/3.

TPS avatar Dec 27 '15 23:12 TPS

@leethomason @jodyp12 @peterbiglr @petko https://github.com/zeux/pugixml/issues/74 shows how https://github.com/zeux/pugixml has a mode that might be helpful to y'all, though it's not preciselyxml:space="preserve"support.

TPS avatar Jan 10 '16 14:01 TPS

I've looked at this and created a few supporting unit tests. Latest pull request: https://github.com/leethomason/tinyxml2/pull/938

IMHO it is a problem just for some rare legacy systems such as ours. It is essential to some but only rare use-cases. As a result rather than relying on current whitespace options, I've created one called PRESERVERRAW_WHITESPACE. White space being just space at present. Seems the only use-case for legacy systems.

So <element> </element> becomes " " if whitespace is PREVESERVERRAW_WHITESPACE is used. Otherwise, it will be "".

"<element>
</element>" is obviously still "". 

I didn't worry about <element> \r\n</element> because I haven't seen a need for this. My guess is it'll still show as space, but can't imagine a legacy system that is lazy enough to not put quotes would bother to put a CrLf.

TangataRereke avatar May 11 '23 02:05 TangataRereke