libplist libplist's new XML parser has numerous bugs (even a segfault!)

OK, so, before I begin, I'm going to say that it has taken me hours to come to the position that trying to just fix this is even the right thing to do. I saw that libplist recently switched to a custom XML parser. I've written an XML parser before, and let me tell you that it isn't easy. When I did it, I had spent months deep in the various XML specifications, and I cared enough to really really handle them well; and even then, it was a learning exercise that I doubt ever worked 100% correctly and which I never considered using in production. The reality is that we already have access to a world class XML parser that tries really hard to do it right: libxml2.

However, I'm now taking this as an opportunity: it turns out that Apple also wrote a custom and broken implementation of XML parsing for their plist parser, so it is currently possible to build plist files which Apple parses correctly and which libplist doesn't. I actually consider this a serious problem, and while I think it is usually foolish to try to implement a custom XML parser, that is the only sane way to get bug-for-bug compatibility with Apple's parser. However, I'd like that to be made an explicit goal: to replicate the behavior of Apple's plist parser in libplist. Given that, it thereby makes sense to spend time trying to make this custom XML parser work.

In fact, by the end of this comment, I hope you will accept the argument that one actually can't correctly implement parsing the plist file format using a true XML parser.

So, let's begin.

Here is a valid plist file containing a copyright symbol. The new version of libplist fails to parse this correctly: if you run it through plistutil to convert it to binary and then run the result through plutil to print, you get an error from both sides.

<plist version="1.0">
<dict>
    <key>copyright</key>
    <string>&#169;</string>
</dict>
</plist>

invalid utf8 sequence in string at index 0
bin.plist: Unexpected character b at line 1

Here's a totally valid plist file which the new versions of libplist fails to correctly parse but which both the old version of libplist and Apple's parser worked with: note the carefully positioned whitespace inside of the closing tag.

<plist version="1.0">
<dict>
    <key>name</key >
    <string>value</string>
</dict>
</plist>

ERROR: Failed to convert input file.

Here's another perfectly valid plist file which generates that same output from the new parser.

<plist>
<dict>
    <key></key>
    <string>value</string>
</dict>
</plist>

We can make this one worse, though: let's add a comment; now it segfaults!

<plist>
<dict>
    <key><!-- test --></key>
    <string>value</string>
</dict>
</plist>

Segmentation fault: 11

BTW: while this file used to "work", if you run this through Apple's parser, you get the following error, as Apple actually doesn't allow comments inside of key values (something I never would have expected).

xml.plist: Encountered improper CDATA opening at line 3

Essentially, they try to parse all directives with <! as CDATA in this context. Here is an XML file which the current libplist supports, but which neither Apple nor the previous versions of libplist supported.

<plist>
<dict>
    <key>test<!test></key>
    <string>value</string>
</dict>
</plist>

Apple returns the CDATA error, while the new version of libplist provides the following weird result.

{
  "test>" => "value"
}

Continuing with this theme, here's another example of an invalid XML file which libplist previously did not support, which Apple also does not support, but which the new version of libplist incorrectly supports. I'm also including the error message from Apple.

<plist>
<dict>
    <key>test&amp</key>
    <string>value</string>
</dict>
</plist>

xml.plist: Encountered unknown ampersand-escape sequence at line 3

You have to be careful with how you fix this, though, because Apple's parser doesn't actually care about the validity of entity references inside of attributes. In fact, they aren't even trying to parse attributes, they seriously just skip to the next >. This is a valid plist file, according to Apple.

<plist Q=">
<dict>
    <key>test</key>
    <string>value</string>
</dict>
</plist>

Of course, this is clearly not a valid XML file, nor is this going to be easy to simulate using an XML parser: this is the example which proves "you can't correctly implement parsing the plist file format using a true XML parser". And this is then the point where these examples start getting awkward.

Here is a simple one: Apple actually allows text past the end of the file. libplist used to allow random strings in some places (which was wrong: Apple normally disallows extra text), but never supported it after the end of the file (as that would make the XML file itself invalid). The new version of libplist is now even more strict. But Apple actually does support text at the end of the file.

<plist>
<dict>
    <key>test</key>
    <string>value</string>
</dict>
</plist>
test

Here's another great one: Apple's plist parser doesn't even try to parse DOCTYPE values correctly. Clearly, libxml2 was really careful about them, and even this new version of libplist is trying to parse embedded DTDs, but Apple doesn't support any of this stuff. They don't even bother correctly parsing out string constants! The following XML file is supported by Apple, but libplist has always balked when parsing.

<!DOCTYPE test ">
<plist>
<dict>
    <key>test</key>
    <string>value</string>
</dict>
</plist>

So, while clearly I expect you to care about the segfault, and while I also totally believe you'll fix the cases where libplist can't parse something Apple does, I'm now going to really motivate why it is important to care about these quirks. Here is an XML file which both Apple and libplist can parse, but which uses parser differentials between the quirks of their separate implementations to get them to both end up with different results (note that in this case the old version of libplist, based on libxml2, rejects this file for being invalid XML).

<plist>
<dict>
    <key>test</key>
    <string =">apple</string>
    <!--<string ">libplist</string><!---->
</dict>
</plist>

When you parse this using Apple's parser, you get the following plist.

{
  "test" => "apple"
}

When you parse this using libplist's current parser, you get this different plist.

{
  "test" => "libplist"
}

:D

(edit:) I spent some more time at this, and have managed to come up with a version of this which lets me give different values to all three parsers: the old version of libplist backed by libxml2, the new version of libplist with the custom parser, and Apple's XML parser. Here I also rely on differences in behavior when there are duplicate key values. This isn't quite as awesome, as I am leaving a garbage key behind in the Apple plist (as the better tricks that can elide entire elements are noticed by libxml2), but usually extra keys are ignored ;P.

<plist>
<dict>
    <key>test</key>
    <string>libxml2</string>
    <key>test</key>
    <string>apple</string>
    <key Q=">">test</key>
    <string>libplist</string>
</dict>
</plist>

Apple:

{
  "">test" => "libplist"
  "test" => "apple"
}

old libplist (libxml2):

{
  "test" => "libxml2"
}

new libplist (custom):

{
  "test" => "libplist"
}

Jan 01 '17 02:01 saurik

Nice! Happy New Year :) I understand that libxml2 is more robust but is has some issues. It's leaking memory and also seems to have problems in multithreaded environments for some reason. That was the reason why I chose to do the custom parsing. See the improvements in the comments of this ticket: https://github.com/libimobiledevice/libimobiledevice/issues/394 I will take care of the issues. I will probably not be perfect in the end but it will be sufficient enough to parse plist data in the usual environments.

Jan 01 '17 03:01 nikias

@nikias Maybe my point was a little confusing (due to the attempt at cute rhetoric/story), but I was actually providing a much more robust reason why you can't use libxml2: Apple also implemented a custom XML parser, and some of the things they support are simply not going to be possible to simulate using libxml2. So, game on! If I have some time I'll be contributing patches to make this parser closely match the behavior I'm observing from Apple's list files. (If the issue really were just memory leaks or something, I'd rather go fix libxml2, but it turns out that libplist really needs a custom parser ;P.)

Jan 01 '17 03:01 saurik

(I was just talking about this with @DHowett, and he pointed out that Apple has changed the license on CFLite to MIT and open-sourced it all as part of Swift, so anyone who ends up here as part of tracking down a parser differential might consider trying to use Apple's code to try to get exactly the expected behavior.)

Jan 01 '17 03:01 saurik

meh, always killing my 0dayz... libxml2 and apple's development teams are on par for their issue response. it couldn't possibly get any worse!!

Jan 01 '17 03:01 posixninja

Ok perhaps a more constructive comment. can we perhaps create some test units? plist really isn't nearly as complicated as xml is. this surely could be made secure (even if not fully compliant with standards)

Jan 01 '17 06:01 posixninja

I just pushed several commits that fix the first 6 issues/examples in this list. @saurik I am not yet sure how to handle the other cases (text after the actual XML, unclosed attributes etc.) but I must really say the "libplist detection" cases are really nice haha :D @posixninja make check does some tests. In fact I am making a test file right now that tests the issues fixed in with latest commits.

Jan 02 '17 00:01 nikias

Added several test cases.

Jan 02 '17 01:01 nikias

Does Python's plistlib (Python 3+) have the same issues?

Jan 03 '17 15:01 Tatsh

plistlib uses expat XML parser so I doubt it has the same issues :)

Jan 03 '17 16:01 nikias

however it will probably have the Apple-only issues listed above.

Jan 03 '17 16:01 nikias

@nikias I use the multi-threaded environment seems to be no problem, Make some changes to the source code, I did not initialize and release multiple times

Feb 04 '17 06:02 xdeng

@nikias Can you separate the custom XML parser branch and libxml2 bugfixes branch ?

my English is not very good :(

Feb 04 '17 06:02 xdeng

@xdeng the custom xml parser works fine so I don't see any reason to switch back. There are some differences but it isn't really a problem for every day usage (like with libimobiledevice).

Feb 14 '17 17:02 nikias

@saurik just note, apple devices can create XML that contains null character inside element value, and it's like that since iOS 3 at least.

my xmlparser is based on expat library.

also, as for issues mentioned above, have not seen any of them, except not closed attribute, which out parser can take care off (it ignores attribute in fact)

May 01 '17 11:05 mexmer

I ran into a big plist file xml format 4x,xxx nodes file size 15MB Using old library parsing will take up very high memory 200MB I want to try the new library now.

Oct 30 '18 09:10 xdeng

libplist libplist copied to clipboard

libplist's new XML parser has numerous bugs (even a segfault!)

libplist
libplist copied to clipboard