epub-specs icon indicating copy to clipboard operation
epub-specs copied to clipboard

Can a conforming RS reject EPUBs that reference external entities?

Open bduga opened this issue 2 years ago • 20 comments

In the RS spec we say:

In addition, when processing XML documents, a reading system MUST NOT resolve 
[external identifiers](https://www.w3.org/TR/xml/#NT-ExternalID) in DOCTYPE, 
ENTITY and NOTATION declarations [[xml](https://www.w3.org/TR/epub-rs-33/#bib-xml)].

However, EPUB check will report an error for such content. For instance, when run against the test for this it will produce this error:

EPUB/content_001.xhtml:4, 36: External entities are not allowed in EPUB v3 documents. 
   External entity declaration found: xxe.
EPUB/foo.xhtml:1, 7: Error while parsing file: element "span" not allowed here; 
   expected element "html" (with xmlns="http://www.w3.org/1999/xhtml")

It is unclear if a RS that rejects this content is conformant. Must it process the document and display the &xxe; in text? There may be other cases where EPUB check reports an error but the spec gives guidance on how to process the document. For content that comes from a publisher, it seems like failing is the best thing to do, as it forces a fix to the content. Should we add some sort of blanket statement that allows an RS to reject such content? And do we need to specify that that is an acceptable outcome for the tests?

bduga avatar Sep 15 '22 21:09 bduga

More similar cases can be found in the following tests:

  • pkg-spine-duplicate-item-hyperlink
  • pkg-spine-duplicate-item-rendering
  • pkg-spine-duplicate-item-ui

bduga avatar Sep 15 '22 21:09 bduga

Doesn't this lead us into defining what it meas to reject content: won't load the publication into the user's library; won't open the publication; won't open the offending document; strips the offending markup; something else?

I don't see why you can't choose any of these options instead for any issue, except that it would be hard to also call such an implementation conforming. Does it lead to a reading system that refuses to open a publication without a title being conforming, for example?

mattgarrish avatar Sep 16 '22 14:09 mattgarrish

Doesn't this lead us into defining what it meas to reject content: won't load the publication into the user's library; won't open the publication; won't open the offending document; strips the offending markup; something else?

Since EPUB is largely used for interchange, I think it is a wholesale rejection of the EPUB when supplied to the Reading System. It would be rejected when a publisher/distributor supplies it to a RS or (possibly) when a user attempts to load it in their RS (up/side load).

I don't see why you can't choose any of these options instead for any issue, except that it would be hard to also call such an implementation conforming. Does it lead to a reading system that refuses to open a publication without a title being conforming, for example?

Sure, I think it is a reasonable question for other cases as well. The issue is the optimal behavior here will depend on circumstances. So for a user uploaded EPUB it may make sense to do our best to display the content with the unexpanded reference (which I think is the current requirement). However, when we get content from a publisher this is almost certainly an error they can fix - ignoring it and allowing the content to enter our system and be sold to customers is a poor experience. Instead we decline it and inform the publisher, so we get the correct rendering to clients. But that in turn makes us non-compliant.

bduga avatar Sep 16 '22 18:09 bduga

In addition, when processing XML documents, a reading system MUST NOT resolve external identifiers in DOCTYPE, ENTITY and NOTATION declarations [xml].

MathML documents containing entities such as ‾ are disallowed by this sentence. Is this OK?

murata2makoto avatar Sep 23 '22 10:09 murata2makoto

I cannot really judge the details of the test (my XML knowledge on such esoteric features as external entities is rusty; @dauwhe should be the judge of the tests itself), I believe there is a more general issue. Namely, that there may be several tests that do not pass epubcheck (and for good reasons, like the one in this example). But, from a testing point of view, that is besides the point; the issue/question is whether the RS reacts as it should in case it does get into a situation to render such faulty document.

iherman avatar Sep 23 '22 10:09 iherman

@iherman

It is torture time even for me. (FYI: I was a member of the original XML WG.)

But I would like to first know if disallowing MathML entities is acceptable for MathML users. Did they already switch to character references or Unicode characters?

murata2makoto avatar Sep 23 '22 11:09 murata2makoto

@iherman

It is torture time even for me. (FYI: I was a member of the original XML WG.)

😀

But I would like to first know if disallowing MathML entities is acceptable for MathML users. Did they already switch to character references or Unicode characters?

I do not think we have any data on that. However, the spec itself refers to MathML3, so specification-wise I believe it is fine to keep this restriction. A RS may decide to ignore it for the purpose of old MathML entities if there is a market need for it, but that should not be (in my view) part of the spec.

iherman avatar Sep 23 '22 12:09 iherman

Assuming that we do not need MathML entities, here is my opinion.

First, what XML processors do and what RSes do are different. In particular, we can make RSes reject the output of XML processors or even refuse to invoke XML processors if there are good reasons.

Second, when we disallow some features of XML in EPUB 3.3, I do not think that we have to say what RSes should do about them in EPUB 3.3 RS. In particular, external entities (both general entities and parameter entities) are disallowed by EPUB 3.3. I thus do not think EPUB 3.3 RS has to say something about them. EPUB 3.3 already allows RSes to reject them and epubcheck to report errors. (Note: This is another example of data conformance dominating application conformance.)

Here is my proposed rewrite of 3.3 in EPUB 3.3 RS.

A reading system MUST use a non-validating XML processor that conforms to [xml] and [xml-names].

murata2makoto avatar Sep 29 '22 15:09 murata2makoto

The issue was discussed in a meeting on 2022-09-29

List of resolutions:

  • Resolution No. 1: Review RS specification for statements around RS behaviour for non-conformant EPUB publications, check if can they be removed safely.
View the transcript

1. Can a conforming RS reject EPUBs that reference external entities? (issue epub-specs#2433)

See github issue epub-specs#2433.

Brady Duga: I was running a test that had external entities. The spec says you must leave the entity as if it wasn't an external entity.
… we reject that EPUB with an error, and request that publisher fix it.
… would that make us non-conforming?.
… and does that mean that conforming RS are not allowed to reject EPUBs, even if they don't pass EPUB check?.
… some related discussion in the issue about what this means for MathML.

Murata Makoto: in general i'm against application conformance. I'd avoid describing required behaviour of RS. I'm for data conformance, i.e. requirements in EPUBs.
… only if we have very strong reasons to describe behaviours should we say something. In this case, I would not say anything about what the specific implementation should be.
… i.e. reject EPUB, crash, display error.

Brady Duga: I agree, and in some sense that feels like what we did in the past.
… especially now that we have a separate spec for RS.
… but we also don't want these EPUBs to work.
… say publisher creates a book with external entity, and some RS displays it "correctly" by showing the external entity.
… creates perception that supporting external entity is okay.

Murata Makoto: EPUB 3.3 Core already disallows external entities.

Brady Duga: people who make EPUBs don't necessarily read the spec.

Murata Makoto: there could be many such errors beyond our imagination.

Brady Duga: i worry that if we say nothing, supporting external entities could also be valid.

Shinya Takami (高見真也): if Core says these external entities are disallowed, then epubcheck should mark this as an error.

Murata Makoto: https://www.w3.org/TR/epub-33/#sec-xml-constraints.

Shinya Takami (高見真也): if there is no prohibition, then there should be a warning.

Wendy Reid: in the issue there is a log from epubcheck that shows an error being output in this case.
… I also agree that its weird when we get too specific when we say what RS should do when they run into specific scenarios.
… in this case though, we had not previously talked about validation or what happens between the publisher and the RS.
… the intermediary layer of epubcheck/ingestion pipeline.
… which often rejects content before it gets to the RS.

Brady Duga: the problem is that the rejection code is the RS, i.e. it is the thing that you submit the authored EPUB to.
… those requirements are on the RS.

Shinya Takami (高見真也): publishers sometimes generate mistakes in EPUB files. RS should ignore if they find external entities in EPUBs.

Brady Duga: from a publishing standpoint, would they rather see an error and a rejection, or allow that to potentially get through to end users.
… I feel they would rather flag the error/external entity right away.

Murata Makoto: a long time ago, there was discussion about draconian error handling in XML processing, i.e. if you should stop right away when you encounter error, or if you should try to recover.
… I feel EPUB should be a case of draconian error handling.
… in a11y world, there is confusion about who should be responsible for what. A similar case of failure of application conformance.

Shinya Takami (高見真也): I disagree. We should try to process EPUBs even if they have errors. Maybe just ignore the error by continue to process.

Brady Duga: i think it depends. When I get a publication from a publisher, I would like to reject right away and notify. When I get an EPUB from a user/sideload, I would like to continue to process as best as I can.
… can the spec be written in a way that makes both acceptable.
… without supporting external entities.

Wendy Reid: I did a test with that file on Thorium and Apple Books. Test passes on Thorium, but fails on Apple Books (both statements display).
… I think I agree that if we can prevent errors, then we should do so. But if someone happens to sideload a book that has external entities in it or whatever, we shouldn't stop them from opening the book.
… how would you explain to an end user that they need to go back to the publisher where they got the book?.

Murata Makoto: unlike in XML, in EPUB we have said nothing about "incorrect publications". We leave it open to RS to define what is incorrect to them..

Wendy Reid: i think we agree that we want to allow a conforming RS to reject something pre-processing, but also make it possible to continue to process.

Murata Makoto: in general, I think we should say nothing about what RS should do. This would ensure the hands of implementors are not tied..

Brady Duga: do we know how many of this type of requirement are in the RS spec?.

Wendy Reid: yeah, those tests about "RS must resolve package metadata", so probably several are in there.
… mgarrish would know.

Murata Makoto: Agreed.

Brady Duga: if there are few enough, should we review them all so that whatever solution we implement is consistent across them all?.

Proposed resolution: Review RS specification for statements around RS behaviour for non-conformant EPUB publications, check if can they be removed safely. (Wendy Reid)

Brady Duga: +1.

Murata Makoto: +1.

Toshiaki Koike: +1.

Wendy Reid: +1.

Shinya Takami (高見真也): +1.

Matthew Chan: +1.

Masakazu Kitahara: +1.

Resolution #1: Review RS specification for statements around RS behaviour for non-conformant EPUB publications, check if can they be removed safely.

iherman avatar Sep 30 '22 08:09 iherman

To move ahead with this issue:

  1. @murata2makoto: reflecting on https://github.com/w3c/epub-specs/issues/2433#issuecomment-1262440034: I believe the current spec already says what you propose. The only issue is the remark in that section (quoted in the issue statement) about rejecting the resolution of external identifiers. We could change the MUST NOT statement to SHOULD NOT, but I do not think that solves the original issue, which is also present in other tests (see https://github.com/w3c/epub-specs/issues/2433#issuecomment-1248627153) that are not related to XML issues. I am therefore not sure that we should make this change.
  2. I believe it is perfectly fine, from a testing perspective, if a RS tester puts a null in the report for those tests which are not applicable to the RS, because it relies on its surrounding infrastructure that would filter out the offending EPUB publications. In other words, I think it is fine to say that those tests are reserved to those RS-s that allow for a public side loading of content, like Apple Books, Thorium, etc.
  3. I see, however, the point that an RS want to declare itself as conformant per spec, and I believe this issue should be mentioned in that section (e.g., in a note). My first, absolutely drafty, text I would consider putting there is something like

Some Reading Systems operate as part of larger publishing workflow that would filter out some non-conformant EPUB publications (e.g. SOME EXAMPLES HERE). As a consequence, implementers of these Reading Systems may decide not to add a strong check on such publications, although required in this specification. This decision does not affect the conformance of these Reading Systems.

I believe this approach would also be in line with the WG resolution.

cc @davemanhall @wareid @dauwhe @shiestyle

iherman avatar Oct 03 '22 07:10 iherman

To move ahead with this issue:

  1. @murata2makoto: reflecting on Can a conforming RS reject EPUBs that reference external entities? #2433 (comment): I believe the current spec already says what you propose.

No, it does not. The current wording confuses reading systems and XML processors. XML processors are different programs invoked by reading systems. RSes may well reject the output of successful parsing by XML processors or even refuse to invoke XML processors if there are good reasons. The current wording does not allow such possibilities, since it confuses RSes and XML processors.

Moreover, I think "a reading system MUST NOT resolve external identifiers in DOCTYPE, ENTITY and NOTATION declarations" is just harmful and has no advantages.

murata2makoto avatar Oct 03 '22 12:10 murata2makoto

To move ahead with this issue:

  1. @murata2makoto: reflecting on Can a conforming RS reject EPUBs that reference external entities? #2433 (comment): I believe the current spec already says what you propose.

No, it does not. The current wording confuses reading systems and XML processors. XML processors are different programs invoked by reading systems. RSes may well reject the output of successful parsing by XML processors or even refuse to invoke XML processors if there are good reasons. The current wording does not allow such possibilities, since it confuses RSes and XML processors.

Hm. This sounds like confusingly meticulous to me. Actually, if we really want to be precise, a reading system (I presume) probably does not "invoke an XML processor"; it, rather, uses an XML parser and processor module/library built into its own application. But, for all purposes, this is an application detail: as far as XML processing goes, the RS acts as an XML processor v.a.v. the EPUB content. I.e., I do not see any problem with the current formulation, to be honest.

Moreover, I think "a reading system MUST NOT resolve external identifiers in DOCTYPE, ENTITY and NOTATION declarations" is just harmful and has no advantages.

I do not see why this is harmful. If the RS was to resolve those external files, it could not be used off-line. It is a proper statement to RS-s not to attempt doing this: it does not bring any advantage and it may be harmful for the user experience. I can live with a SHOULD NOT instead of MUST NOT, but I believe the overall direction of that sentence is fine.

iherman avatar Oct 03 '22 15:10 iherman

as far as XML processing goes, the RS acts as an XML processor v.a.v. the EPUB content.

Here we disagree. I think that an RS can certainly refuse what an XML processor emits. For example, there is nothing wrong even if we specify that an RS MUST reject every XML document containing a DTD. Meanwhile, an XML processor MUST NOT reject such XML documents.

Moreover, I think "a reading system MUST NOT resolve external identifiers in DOCTYPE, ENTITY and NOTATION declarations" is just harmful and has no advantages.

I do not see why this is harmful.

I have a number of concerns. One is data conformance .vs. application conformance. (I would like to have online discussion about this.) Another is the confusion between RSes and XML processors. Yet another is that this sentence is encroaching on the territory of the XML recommendation. Why should EPUB define conformance conditions of XML processors? Use XML as is.

murata2makoto avatar Oct 03 '22 15:10 murata2makoto

as far as XML processing goes, the RS acts as an XML processor v.a.v. the EPUB content.

Here we disagree. I think that an RS can certainly refuse what an XML processor emits. For example, there is nothing wrong even if we specify that an RS MUST reject every XML document containing a DTD. Meanwhile, an XML processor MUST NOT reject such XML documents.

Moreover, I think "a reading system MUST NOT resolve external identifiers in DOCTYPE, ENTITY and NOTATION declarations" is just harmful and has no advantages.

I do not see why this is harmful.

I have a number of concerns. One is data conformance .vs. application conformance. (I would like to have online discussion about this.) Another is the confusion between RSes and XML processors. Yet another is that this sentence is encroaching on the territory of the XML recommendation. Why should EPUB define conformance conditions of XML processors? Use XML as is.

@murata2makoto I cannot say I fully understand what you say, but we indeed do seem to disagree on both points. But I believe the best at this stage is if you submit a PR offering an alternative to the current text on XML conformance. That would make the discussion more productive.

Two things to note, though:

  1. This seems to be a separate issue from what @bduga raised. His problem is how a conformant RS should behave, and can we declare an RS conformant even if the corresponding tests are not really applicable. The XML thing was just an example. Ie, you should raise a separate issue on XML conformance issues and bind a PR to it.
  2. We are in CR. If your proposal leads to a substantial change that would, potentially, invalidate current implementation reports, tests, etc, we may have to publish a CR snapshot and re-set the clock for IPR processes. This can all be done, of course, but we are late in the process, and we should process with care...

iherman avatar Oct 03 '22 16:10 iherman

Moreover, I think "a reading system MUST NOT resolve external identifiers in DOCTYPE, ENTITY and NOTATION declarations" is just harmful and has no advantages.

Isn't this standard practice for preventing XXE attacks, though? In this case, the reading system has to disable this processing when invoking the xml processor. Even if we want to say that the reading system is using an xml processor rather than being one, shouldn't we still have some statement about the security risks? (It could arguably go in the security section, of course.)

mattgarrish avatar Oct 03 '22 16:10 mattgarrish

@mattgarrish

XML has several security risks. Resolving external entities is one of them, but internal entities can be used to cause more serious problems. Such security risks are well described in RFC 7303. XXE attacks can be prevented in many ways. XML catalogs or something similar can be used for this purpose.

I do not believe that conformance requirements are the right place for addressing such security risks. Notes or warnings sound more appropriate.

murata2makoto avatar Oct 04 '22 01:10 murata2makoto

Resolving external entities is one of them, but internal entities can be used to cause more serious problems.

Yes, but we don't restrict internal entities in the authoring spec. The external restriction was added, as I recall, to match up with the restriction against external identifiers in authoring. It's still good to be able to note why that restriction is in place for RS developers.

I do not believe that conformance requirements are the right place for addressing such security risks.

Agree. That's why I'm fine moving this to the security section. We can add a new subsection there for XML risks.

But as @iherman has already mentioned, we should move this to another issue as it's not resolving the question that was asked.

mattgarrish avatar Oct 04 '22 11:10 mattgarrish

@murata2makoto thanks for raising #2447. But there seem to be another issue still pending (besides the one that has triggered this thread): do you still want to change the text in the current spec? If so, to what?

iherman avatar Oct 05 '22 12:10 iherman

But there seem to be another issue still pending (besides the one that has triggered this thread): do you still want to change the text in the current spec? If so, to what?

I will create another pull request.

murata2makoto avatar Oct 05 '22 13:10 murata2makoto

The issue was discussed in a meeting on 2022-10-07

  • no resolutions were taken
View the transcript

2. XML security risk: expansion of internal parsed entities (issue epub-specs#2447)

See github issue epub-specs#2447.

See github issue epub-specs#2433.

Murata Makoto: old issue; more than 20 years ago.
… related to external parsed entities issue 2433.
… this issue is about INTERNAL parsed entities.
… doesn't use external identifiers or URLs.
… these are defined in internal DTD subsets.

Dave Cramer: See description of the attack.

Murata Makoto: if an internal reference references a different internal reference, things can get out of hand quickly.
… some EPUB reading systems ignore rec of XML and ignore internal parsed entities. This is non-conformant but avoids security issue.
… what should we do? Prohibit internal DTD subsets?.
… which gets rid of internal parsed entities..
… can exissting RSes be destryeed by malicious content?.

See github pull request epub-specs#2451.

Ivan Herman: thanks Makoto for raising the issue.
… the problem is that removing the possibility to use internal entities is something we can't do because of existing epub docs that might use this.
… we cannot invalidate existing content per our charter.
… as part of a pull request that makoto and I did together is a separate section.
… in the security section of RS spec.

Ivan Herman: See proposed text in the PR's security section.

Ivan Herman: which say that RS should be aware of this problem and should deal with it..
… so we should draw attention to the issue.

Dave Cramer: I don't think we can or should forbid internal entities.
… the problem has been around 20 years and EPUB is still functional.
… it's a security vulnerability, we may need to warn people in our security section.
… ordering RSs to follow a specific algorithm might be overkill.
… let's warn people.

Murata Makoto: Basically, not our problem.

Brady Duga: this is a known problem in XML. Are there known solutions?.
… do they have mitigations? Like in libxml?.

Dave Cramer: "Defenses against this kind of attack include capping the memory allocated in an individual parser if loss of the document is acceptable, or treating entities symbolically and expanding them lazily only when (and to the extent) their content is to be used.".

Ivan Herman: makoto may know about libraries.
… makoto made some sample epubs, and we tested them.
… I tested on Thorium and Apple Books.
… both reacted by rejecting the EPUB.
… without identifying the problem.
… we suspect they don't accept internal entities.

Murata Makoto: I'm not aware of general solutions.
… XML WG was aware of this issue, but had no good solutions.
… thorium doesn't reject publication, it ignores the entity but not the entire publication.
… this behavior is non-conformant.

Dave Cramer: I don't think it's entirely fair to say that this implementation is non-conformant because it's not deploying my vulnerability.
… not fair to RS.

Wendy Reid: it's like the viewport discussion.
… we want to avoid putting burdens on readers.
… does the fail quietly solution work?.
… is that a bad thing?.
… I think because we are running out of time.
… people should look at the PR.
… there are other things in the PR.
… then we move on.

Dave Cramer: I think we add something to the security section.
… it hasn't been a problem in the past, internal entities aren't usually a problem unless malicious.
… what are the consequences of a malicious use.
… the consequences seem small.
… good enough to say "XML has a known vulnerability:.

Brady Duga: I agree putting this in the security section, maybe mentioning other xml vulnerabilities.
… any attack like this can jeopardize personal information.
… although no one has bothered to do this.

Dave Cramer: Some of this is also outside the scope of the WG, we can't ask for people to patch the OS they are running dev machines on.

Wendy Reid: consensus: let's look people look at Ivan's PR.
… comment on it.
… ten minutes remaining.
… two more topics.

iherman avatar Oct 07 '22 15:10 iherman