quick-xml icon indicating copy to clipboard operation
quick-xml copied to clipboard

Declared entity is not recognized

Open pchampin opened this issue 4 years ago • 10 comments

When a file defines new entities into its DOCTYPE, quick-xml does not recognize these entities in attribute values or text nodes. This causes an EscapeError(UnrecognizedSymbol(...)) error.

I built a minimal example demonstrating this problem...

use quick_xml::Reader;
use quick_xml::events::Event;

const DATA: &str = r#"

    <?xml version="1.0"?>
    <!DOCTYPE test [
    <!ENTITY msg "hello world" >
    ]>
    <test label="&msg;">&msg;</test>

"#;

fn main() {
    
    let mut reader = Reader::from_str(DATA);
    reader.trim_text(true);
    
    let mut buf = Vec::new();
    
    loop {
        match reader.read_event(&mut buf) {
            Ok(Event::Start(ref e)) => {
                match e.name() {
                    b"test" => println!("attributes values: {:?}",
                                        e.attributes().map(|a| a.unwrap().unescape_and_decode_value(&reader).unwrap())
                                        .collect::<Vec<_>>()),
                    _ => (),
                }
            },
            Ok(Event::Text(ref e)) => {
                println!("text value: {}", e.unescape_and_decode(&reader).unwrap());
            },
            Ok(Event::Eof) => break,
            Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
            _ => (),
        }
    }
}

I wasn't sure if that was a dublicate of #124 because that issue is about external entities...

pchampin avatar Feb 05 '21 09:02 pchampin

If I understand correctly, quick-xml does not intend to parse the content of the DOCTYPE, which is fair enough.

I would be happy with a (hopefully simpler) feature to add new entities to the parser. That way, I could:

  • subscribe to the DOCTYPE event
  • make my custom parsing of its content to extract only local entities (that's really just what I need)
  • inform the Reader about those entities so that it does not choke on their use in the rest of the file.

pchampin avatar Feb 05 '21 14:02 pchampin

Indeed quick-xml doesn't support parsing DOCTYPE for the moment. I'd be happy to review a PR but I believe this is not a trivial change.

One way you could work around it is to:

  • get the DocType text event and parse the entities into a hashmap
  • when getting the attributes, don't try to unescape values, instead search in your map first then only unescape (using Attribute::value field and the unsecape function)) Alternatively if you feel like you could improve quick-xml I'd accept any change, eventually behing a feature flag if needed.

tafia avatar Feb 08 '21 04:02 tafia

Great minds think alike: during the weekend, I came up with a similar solution (see #261). The main difference is that my PR extends the API with unescape_..._with_custom_entities methods, where the user can pass the map built from parsing the DOCTYPE.

pchampin avatar Feb 08 '21 09:02 pchampin

I'm running into the same issue, is there any plan to fix this? Or a workaround that works when using serde deserialisation rather than the event API directly?

qwandor avatar Feb 10 '22 22:02 qwandor

@qwandor my PR above was merged, so the workaround described above can be used.

For Serde, however, I don't know...

pchampin avatar Feb 11 '22 13:02 pchampin

For some complex situations _with_custom_entities won't be enough.

Example: as per the specificiation, entity replacement in text contents can impact XML parsing:

If the DTD contains the declaration

<!ENTITY example "<p>An ampersand (&#38;) may be escaped numerically (&#38;#38;) or with a general entity (&amp;).</p>" >

then the XML processor will recognize the character references when it parses the entity declaration, and resolve them before storing the following string as the value of the entity " example ":

<p>An ampersand (&) may be escaped numerically (&#38;) or with a general entity (&amp;).</p>

A reference in the document to " &example; " will cause the text to be reparsed, at which time the start- and end-tags of the p element will be recognized and the three references will be recognized and expanded, resulting in a p element with the following content (all data, no delimiters or markup):

An ampersand (&) may be escaped numerically (&) or with a general entity (&).

Architecturally that would be rather difficult to implement as things currently stand.

However: this same "feature" is responsible for a lot of denial of service attacks and security issues, so maybe we don't want to ever implement that.

https://en.wikipedia.org/wiki/Billion_laughs_attack https://knowledge-base.secureflag.com/vulnerabilities/xml_injection/xml_entity_expansion_vulnerability.html

dralley avatar Sep 26 '22 02:09 dralley

@dralley _with_custom_entities is not meant as a general solution to the problem, but as a workaround for some common practices in the RDF/XML (see https://github.com/pchampin/sophia_rs/issues/98).

so maybe we don't want to ever implement that.

Indeed, I believe that it is one of the reasons why quick-xml does not parse the DOCTYPE in the first place.

pchampin avatar Sep 26 '22 12:09 pchampin

Probably, we want to do more robust parsing of DTD, because currently we fail to parse this (because we simply count < and > inside DOCTYPE definition to know where DTD is ended):

<!DOCTYPE e [
    <!ELEMENT e ANY>
    <!ATTLIST a>
    <!ENTITY ent '>'>
    <!NOTATION n SYSTEM '>'>
    <?pi >?>
    <!-->-->
]>
<e/>

According to the https://www.truugo.com/xml_validator/, this is valid XML document. I also created a pegjs grammar by adapting rules from https://www.w3.org/TR/xml11, and it accepts this DTD.

PEGjs DTD grammar
// DTD parser from https://www.w3.org/TR/xml11
// rule S renamed to _, S? replaced by __
doctype = '<!DOCTYPE' _ Name (_ @ExternalID)? __ ('[' @intSubset ']' __)? '>';
ExternalID
  = 'SYSTEM' _ SystemLiteral
  / 'PUBLIC' _ PubidLiteral _ SystemLiteral;
SystemLiteral
  = '"' @$[^"]* '"'
  / "'" @$[^']* "'"
  ;
PubidLiteral
  = '"' @$PubidChar* '"'
  / "'" @$(!"'" PubidChar)* "'"
PubidChar = [- \r\na-zA-Z0-9'()+,./:=?;!*#@$_%];

intSubset = (markupdecl / DeclSep)*;
markupdecl
  = elementdecl
  / AttlistDecl
  / EntityDecl
  / NotationDecl
  / PI
  / Comment
  ;
elementdecl = '<!ELEMENT' _ Name _ contentspec __ '>';
contentspec = 'EMPTY' / 'ANY' / Mixed / children;
Mixed
  = '(' __ '#PCDATA' (__ '|' __ @Name)* __ ')*'
  / '(' __ '#PCDATA' __ ')'
  ;
children = (choice / seq) ('?' / '*' / '+')?;
choice = '(' __ cp (__ '|' __ @cp)+ __ ')';
seq    = '(' __ cp (__ ',' __ @cp)* __ ')';
cp = (Name / choice / seq) ('?' / '*' / '+')?;

AttlistDecl = '<!ATTLIST' _ Name AttDef* __ '>';
AttDef = _ Name _ AttType _ DefaultDecl;
AttType = StringType / TokenizedType / EnumeratedType;
StringType = 'CDATA'
TokenizedType
  = 'IDREFS'
  / 'IDREF'
  / 'ID'
  / 'ENTITY'
  / 'ENTITIES'
  / 'NMTOKENS'
  / 'NMTOKEN'
  ;
EnumeratedType = NotationType / Enumeration;
NotationType = 'NOTATION' _ '(' __ Name (__ '|' __ @Name)* __ ')';
Enumeration = '(' __ Nmtoken (__ '|' __ @Nmtoken)* __ ')';
DefaultDecl
  = '#REQUIRED'
  / '#IMPLIED'
  / ('#FIXED' _)? AttValue
  ;
AttValue
  = '"' @([^<&"] / Reference)* '"'
  / "'" @([^<&'] / Reference)* "'"
  ;
Reference = EntityRef / CharRef;
EntityRef = '&' Name ';'
CharRef
  = '&#' $[0-9]+ ';'
  / '&#x' $[0-9a-fA-F]+ ';'
  ;

EntityDecl = GEDecl / PEDecl;
GEDecl = '<!ENTITY' _ Name _ EntityDef __ '>';
PEDecl = '<!ENTITY' _ '%' _ Name _ PEDef __ '>';
EntityDef = EntityValue / (ExternalID NDataDecl?);
PEDef = EntityValue / ExternalID;
EntityValue
  = '"' @([^%&"] / PEReference / Reference)* '"'
  / "'" @([^%&'] / PEReference / Reference)* "'"
  ;
NDataDecl = _ 'NDATA' _ Name;

NotationDecl = '<!NOTATION' _ Name _ (ExternalID / PublicID) __ '>';
PublicID = 'PUBLIC' _ PubidLiteral;

PI = '<?' PITarget (_ @$(!'?>' Char)*)? '?>';
PITarget = !'xml'i @Name;

Comment = '<!--' $((!'-' Char) / ('-' (!'-' Char)))* '-->';

DeclSep = PEReference / _;
PEReference = '%' Name ';';

_ = $[ \t\r\n]+;
__ = $[ \t\r\n]*;

Name = $(NameStartChar NameChar*);
Nmtoken = $NameChar+;

Char = [\u0001-\uD7FF\uE000-\uFFFD] // / [\u10000-\u10FFFF];
NameStartChar = [:A-Za-z_\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D\u037F-\u1FFF\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD] // / [\u10000-\uEFFFF]
NameChar = NameStartChar / [-.0-9] / '\xB7' / [\u0300-\u036F] / [\u203F-\u2040];

Mingun avatar Jun 11 '23 20:06 Mingun

Related to dealing with files that break parsing because of this issue:

Is there a way to decode a BytesText into an str without unescaping at all?

melissaboiko avatar May 02 '24 16:05 melissaboiko