ox icon indicating copy to clipboard operation
ox copied to clipboard

Parse error for comments containing special characters inside a DOCTYPE declaration

Open okeeblow opened this issue 3 years ago • 1 comments

Here's a corner case I ran into when trying to parse the chemical-mime project's XML:

irb(main):002:0> Ox::VERSION
=> "2.14.5"
irb(main):003:0> Ox::load_file('/home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/mime/packages/third-party/chemical-mime/chemical-mime-database.xml.in')
(irb):3:in `load_file': invalid format, dectype not terminated at line 1606, column 18 [parse.c:339] (Ox::ParseError)
        from (irb):3:in `<main>'
        from ./bin/repl:11:in `<main>'

That line 1606 is the end of the file, so the parser ran away searching for an end-symbol that never occurred. The trigger is an XML comment inside a <!DOCTYPE declaration when that comment contains one of the characters ", ', [, or <. In my case it was the less-than symbol seen right here on line 81: https://github.com/dleidert/chemical-mime/blob/4fd66e3b3b7d922555d1e25587908b036805c45b/src/chemical-mime-database.xml.in#L81

This does seems to be valid XML syntax according to my reading of the spec: "[comments] may appear within the document type declaration at places allowed by the grammar" https://www.w3.org/TR/REC-xml/#sec-comments

The cause within Ox is the handling of those four special characters in parse.c's and sax.c'sread_delimited (both versions) as called by read_doctype:

  • https://github.com/ohler55/ox/blob/337b0827b6843ba762c2dfa7af1c03d95bace0e6/ext/ox/parse.c#L341-L352
  • https://github.com/ohler55/ox/blob/5cc4bff6516c4cb35e7e2d7ace4348350dbfc1b2/ext/ox/sax.c#L572-L583

I narrowed it down to some very basic test cases. The simplest case of comment-inside-doctype without special characters does parse successfully but leaves the comment as plain text in the Ox::DocType's value:

irb(main):001:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment -->\n] l>")
=> 
#<Ox::Document:0x0000558c20da4380
 @attributes={:version=>"1.0", :encoding=>"UTF-8"},
 @nodes=[#<Ox::DocType:0x0000558c20da4268 @value="hey [\n<!-- this is a comment -->\n] l">]>

The other four all fail:

irb(main):002:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a < character -->\n] l>")
(irb):2:in `load': invalid format, dectype not terminated at line 4, column 8 [parse.c:339] (Ox::ParseError)
irb(main):003:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a \" character -->\n] l>")
(irb):3:in `load': invalid format, dectype not terminated at line 4, column 10 [parse.c:339] (Ox::ParseError)
irb(main):004:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a ' character -->\n] l>")
(irb):4:in `load': invalid format, dectype not terminated at line 4, column 10 [parse.c:339] (Ox::ParseError)
irb(main):005:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a [ character -->\n] l>")
(irb):5:in `load': invalid format, dectype not terminated at line 4, column 8 [parse.c:339] (Ox::ParseError)

I attempted to patch this myself but got a little lost trying to understand the interaction between the PInfo's position pointer and text value, and then very lost with the SAX parser's buffer checkpoint/checkback handling :p

For posterity here is as far as I got:

diff --git a/ext/ox/parse.c b/ext/ox/parse.c
index e9e28b3..0aa10ed 100644
--- a/ext/ox/parse.c
+++ b/ext/ox/parse.c
@@ -330,6 +330,11 @@ read_delimited(PInfo pi, char end) {
        }
     } else {
        while (1) {
+      if (0 == strncmp("<!--", pi->s, 4)) {
+    pi->s += 4;
+    read_comment(pi);
+    break;
+      }
            c = *pi->s++;
            if (end == c) {
                return;

Which fixed only the non-SAX parsing, still resulted in the comment text getting added to the DocType's value, and appended the new Comment to the top-level Document instead of to the DocType:

irb(main):002:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a < character -->\n] l>")
=>                                                                                       
#<Ox::Document:0x0000563eec630d90                                                                                                 
 @attributes={:version=>"1.0", :encoding=>"UTF-8"},                                                                                                    
 @nodes=                                                                                                                          
  [#<Ox::Comment:0x0000563eec630c28 @value="this is a comment containing a < character">,                                         
   #<Ox::DocType:0x0000563eec630bd8 @value="hey [\n<!-- this is a comment containing a < character">]>

For my purposes I don't actually need the contents of any of these doctype comments, just for the rest of the file to parse successfully with my SAX parser. Feel free to consider this one the lowest of low-priority though :)

okeeblow avatar Sep 02 '21 01:09 okeeblow

I'll take a look and get it fixed. It certainly is an edge case. I appreciate the detective work too.

ohler55 avatar Sep 02 '21 02:09 ohler55