pugixml
                                
                                 pugixml copied to clipboard
                                
                                    pugixml copied to clipboard
                            
                            
                            
                        Need help rejecting invalid entity expansions
The classic billion laughs attack relies on entity declarations repeatedly expanding entities. I'm glad you can't do that, but I would like to have you give an error on the entity that didn't get expanded. That is, with this snippet:
   :
 <!ENTITY lol7 "&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;">
 <!ENTITY lol8 "&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;">
 <!ENTITY lol9 "&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;">]>
<lolz>&lol9;</lolz>
I would like you to report an error on &lol9;. Your manual says "... so you can do additional processing later." I cannot detect this error case from someone deliberately writing &lol9;. That would expand to look like the error case, but it never intended to expand lol9.
Reporting this error won't even slow down the parser in the normal case. In strconv_escape all the good cases return. All the error cases break, and end up at the bottom of the function, with return stre;. All you need to do is give an error there, and you've got it.
(You can't give an error in strconv_escape, or in the function that calls it, because you don't actually throw exceptions, but set variables in the class and return the pointer to where the error was. I made the strconv_escape return the pointer to the '&', and I made the caller and caller's caller check if *s=='&' and give an error. That's lame, and someone else could probably do better.)
I cannot detect this error case from someone deliberately writing &lol9;
If you want to perform custom entity expansion yourself, you need to disable parse_escapes during parsing. After this you're free to perform your own unescaping with arbitrary logic, including error handling etc.
I'm perfectly happy with this not being expanded, it's just that Pugi should give an error that it didn't do it.
If course I could turn the option off, and I can write my own xml parser. I'm just saying that it is really easy for you to detect the problem in the source. When you get to the code to expand entities, if you don't find an entity to expand, you just add (effectively) else if the option to report errors on bad expansions is set, then report an error here
There would be absolutely no cost in the happy path, and the error handling would be significantly better. It doesn't matter which you consider default and which is optional. (well, of course the default should be the current behavior.)
strconv_escape doesn't have the context pointer of any kind. For it to store the error, it would need to get an extra pointer that needs to be propagated all the way through all other functions used for all decoding, which adds overhead.
This, combined with the fact that it's a backwards incompatible change that will definitely make some existing files unparseable, makes it a not so easy change. I'm not even sure that the defaults should be to reject such files since that would mean that pugixml can't parse valid XML files that depend on external entities to be parsed in full.
Yeah, that strconv_escape can't directly emit errors makes it harder. I got it to work by making a bogus return value that had to be tested by the caller, and finally emitted by the caller's caller. That stunk. I assumed you'd see a more clever way to be able to report an error.
Of course if it could be reported, it should only be by an option, by default off, so that all existing usage is unchanged.
I worked around this defect by scanning the Xml before giving it to Pugi, and emitting a warning if there were any entity definitions, saying "if you try to use this, it won't work and your entity will be left in the text." It's lame, but it's the best I can do. For my purposes I do not need the expansions, so it's enough.