tinyxml2 icon indicating copy to clipboard operation
tinyxml2 copied to clipboard

Parsing fails because of processing instruction

Open andiwand opened this issue 6 years ago • 6 comments

Test XML:

<?xml version="1.0" encoding="UTF-8"?>
<!--that was a xml declaration-->
<root>
	<?pi_target pi_content?>
	<!--that was a processing instruction-->
</root>

Can be validated. (https://codebeautify.org/xmlvalidator)

Test code:

#include <cassert>
#include "tinyxml2.h"

int main() {
    tinyxml2::XMLDocument doc;
    tinyxml2::XMLError error = doc.Parse("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
              "<!--that was a xml declaration-->"
              "<root>"
              "<?pi_target pi_content?>"
              "<!--that was a processing instruction-->"
              "</root>");
    // valid according to https://codebeautify.org/xmlvalidator

    assert(error == tinyxml2::XMLError::XML_SUCCESS);
    assert(doc.FirstChild()->ToDeclaration() != nullptr);
    assert(doc.FirstChild()->NextSibling()->ToComment() != nullptr);
    assert(doc.FirstChild()->NextSibling()->NextSibling()->ToElement() != nullptr);
    //assert(doc.FirstChild()->NextSibling()->NextSibling()->ToElement()->FirstChild()->ToUnknown() != nullptr);

    return 0;
}

Hotfix: https://github.com/leethomason/tinyxml2/compare/master...andiwand:hotfix/processing-instruction

References:

  • https://en.wikipedia.org/wiki/Processing_Instruction
  • https://www.w3.org/TR/REC-xml/#sec-pi

andiwand avatar Oct 09 '19 09:10 andiwand

hi @andiwand

#include <cassert>
#include "tinyxml2.h"

int main() {
    XMLDocument doc;
    doc.Parse("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
              "<!--that was a xml declaration-->"
              "<root>"
    //          "<?pi_target pi_content?>"
              "<!--that was a processing instruction-->"
              "</root>");
printf("%d",doc.ErrorID()); ----success:doc.ErrorID()=0,fail:doc.ErrorID()=11

the situation can parse success When there is a declaration inside the node, the xml parsing fails. I personally think this may be a bug, because when I use libxml2, this situation can be successfully parsed.

Alanscut avatar Oct 14 '19 07:10 Alanscut

hi @Alanscut ! the problem is that tinyxml2 classifies the statement <?pi_target pi_content?> as an xml declaration but it is not, it is a processing instruction and valid xml. this is a bug because there is no distinction between those two things. in my case i cannot simply remove the processing instruction from the input because i read it from a file.

andiwand avatar Oct 14 '19 07:10 andiwand

Hi, Thx for the posts. I ran into the same issue. Here is a simple proposal for a fix in tinyxml2.cpp function XMLDocument::Identify May be somebody can validate it and include the changes into the next update.

` // These strings define the matching patterns: static const char* xmlHeader = { "<?xml" }; static const char* ProcessingInstructionHeader = { "<?" }; static const char* commentHeader = { "<!--" }; static const char* cdataHeader = { "<![CDATA[" }; static const char* dtdHeader = { "<!" }; static const char* elementHeader = { "<" }; // and a header for everything else; check last.

static const int xmlHeaderLen		= 5;						
static const int ProcessingInstructionHeaderLen	= 2;		
static const int commentHeaderLen	= 4;
static const int cdataHeaderLen		= 9;
static const int dtdHeaderLen		= 2;
static const int elementHeaderLen	= 1;

TIXMLASSERT( sizeof( XMLComment ) == sizeof( XMLUnknown ) );		// use same memory pool
TIXMLASSERT( sizeof( XMLComment ) == sizeof( XMLDeclaration ) );	// use same memory pool
XMLNode* returnNode = 0;
if ( XMLUtil::StringEqual( p, xmlHeader, xmlHeaderLen ) ) {
    returnNode = CreateUnlinkedNode<XMLDeclaration>( _commentPool );
    returnNode->_parseLineNum = _parseCurLineNum;
    p += xmlHeaderLen;
}
else if ( XMLUtil::StringEqual( p, ProcessingInstructionHeader, ProcessingInstructionHeaderLen ) ) {
    returnNode = CreateUnlinkedNode<XMLUnknown>( _commentPool );	// it needs to be XMLUnknown		
    returnNode->_parseLineNum = _parseCurLineNum;													
    p += ProcessingInstructionHeaderLen;															
}

`

ReinholdH avatar Aug 31 '20 18:08 ReinholdH

Just want to add to my proposal that according to the spec for processing instructions https://www.w3.org/TR/REC-xml/#sec-pi static const char* xmlHeader = { "<?xml" }; static const char* xmlHeader = { "<?XML" }; XML in capital letters needs to be checked, too.

ReinholdH avatar Sep 04 '20 07:09 ReinholdH

Well, according to the spec, if I read it correctly, PI which is XML declaration must be only ?xml (lowercase), but in other Processing instructions names like xml and XML, and mixed case like xMl XmL etc are disallowed.

Hence it is enough to check only against <?xml

EDIT: link to xml decl: https://www.w3.org/TR/REC-xml/#sec-prolog-dtd

omgtehlion avatar Sep 04 '20 09:09 omgtehlion