epubcheck icon indicating copy to clipboard operation
epubcheck copied to clipboard

SVG DTD not accepted

Open iherman opened this issue 4 years ago • 32 comments

(This seems to be related to https://github.com/w3c/epubcheck/issues/173#issuecomment-27623760, although not to rdf:RDF, which is the original topic of that issue.)

The W3C SVG Logo is not accepted by epubcheck, complaining about "External DTD entities are not allowed.". That is problematic.

The SVG file itself does not include any external DTD entity. However, the Logo file does contain:

<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">

Indeed, the DTD file itself contains loads of entities. It is also correct that the latest version of SVG (SVG2) has removed the reference to a DTD (see change section). But flagging this is as an error does create backward incompatibility issues with bona fide SVG1.1 content which did rely on the DTD (e.g., all examples in SVG1.1 include that DTD reference, but also all SVG content produced by earlier versions of Adobe Illustrator, to take just this example). Note that all browsers accept such SVG 1.1 content without further ado.

I would think that epubcheck should simply ignore the SVG DTD and shouldn't report any error or warning on that one.

iherman avatar Mar 16 '20 09:03 iherman

B.t.w., formally, the latest Recommendation is still SVG 1.1. SVG2 is a Candidate Recommendation.

iherman avatar Mar 16 '20 09:03 iherman

I guess this issue should be discussed in the EPUB CG?

EPUBCheck conforms to the EPUB spec here. I personally agree the rule is probably too strict, especially for SVG used as images and not as content docs. But any deviation from the spec should be approved by the CG 😊

rdeltour avatar Mar 16 '20 09:03 rdeltour

This would have to be a change to the core spec, as external references are strictly forbidden from DTDs per https://www.w3.org/publishing/epub3/epub-spec.html#confreq-xml-extmarkupdecl

It was put in at least in part for security reasons, as I recall, but perhaps there's a case to allow specific external references outside of HTML files. Definitely needs discussion.

mattgarrish avatar Mar 16 '20 10:03 mattgarrish

A similar issue may actually come up with MathML, if the DTD is used: https://www.w3.org/Math/DTD/mathml3/mathml3.dtd has similar constructions.

I would think that the spec (and check) should relax on the standard, official DTD-s for the accepted content types. I am not sure why a reading system would even look at those DTDs in the first place.

iherman avatar Mar 17 '20 05:03 iherman

Have any publishers complained? If no publishers complain, I do not want to fix EPUB3.

murata2makoto avatar Mar 17 '20 07:03 murata2makoto

I am not a publisher:-), but I stumbled on this issue through a script that turns W3C recommendations into EPUB.

I realize that is a small and not-very-important example. However, we see (at last!) a number of tools producing EPUB files from, e.g., popular editors (Apple Pages, Google Doc, LibreOffice, soon MS Word). Which is of course great, it makes EPUB becoming some sort of commodity; production of EPUB is not, and should not, be done only by bona fide publishers. However, alongside this evolution, we will also see, for example, more SVG content coming into EPUB from tools like Adobe Illustrator. (The specific SVG file that hit me was an Illustrator output.) That means such issues will come.

B.t.w., such a change can be done in a very mild way, without affecting any deployed content. The essence of the restriction may remain in place; the only exception would be to ignore the official DTD-s of contents that are listed in the EPUB spec (MathML, SVG, SMIL, ...). I do not see any harm doing that.

iherman avatar Mar 17 '20 08:03 iherman

It was put in at least in part for security reasons, as I recall, but perhaps there's a case to allow specific external references outside of HTML files. Definitely needs discussion.

Is this also a matter of making EPUBs self-contained? If you need a network request to parse an XML file, how would the EPUB work in the proverbial train tunnel?

The security implications seem non-trivial:

https://en.wikipedia.org/wiki/XML_external_entity_attack https://en.wikipedia.org/wiki/Billion_laughs_attack https://insinuator.net/2015/03/xxe-injection-in-apache-batik-library-cve-2015-0250/

dauwhe avatar Apr 28 '20 19:04 dauwhe

The security implications seem non-trivial:

Ya, but the solution in epub tends to be. Reading systems should be the ones not resolving external entity references. There's not much value in restricting authoring of such references, as if they're being used maliciously are you really running them through epubcheck?

mattgarrish avatar Apr 28 '20 20:04 mattgarrish

It was put in at least in part for security reasons, as I recall, but perhaps there's a case to allow specific external references outside of HTML files. Definitely needs discussion.

Is this also a matter of making EPUBs self-contained? If you need a network request to parse an XML file, how would the EPUB work in the proverbial train tunnel?

I think maintaining the restriction may be a good idea. However, there may be a set of well known, and standard DTD, whose parsing is probably unnecessary anyway, ie, that serve only some sort of identification. That should include the DTD-s for the content documents that are allowed in EPUB in the first place, ie, SVG, MathML, or SMIL (I did not check whether all of them depend on entities).

iherman avatar Apr 29 '20 06:04 iherman

Is this also a matter of making EPUBs self-contained? If you need a network request to parse an XML file, how would the EPUB work in the proverbial train tunnel?

In EPUBCheck, we use a URL resolver and local copies of DTDs and other references. I guess a similar approach would be taken by a reading system to prevent unnecessary network requests.

As for security implications, they're real indeed, but again this is the kind of things that are better dealt with at the software level than at the spec level, IMO. The one benefit of having such restrictions in the spec is that it becomes harder for malicious content to enter curated distribution channels which run EPUBCheck as a gatekeeper, but it doesn't mean RS wouldn't have safeguards anyways (for non-curated content).

For instance, a few years ago, EPUBCheck was reported a vulnerability to XML external entity attack (CVE-2016-9487), which we fixed in version 4.0.2. The spec didn't prevent this happening.

rdeltour avatar Apr 29 '20 07:04 rdeltour

This is a very old and annoying issue to me. The original XML people tried to solve it, but it still bites us.

Allowing external DTD subsets might open a can of worms, but we can probably avoid them.

First, the standard DTD for SVG 1.1 does not define parsed entities (unless people customize the DTD). Thus, XML processors that do not examine the external DTD subset have no problems in parsing SVG documents.

Second, this is not true for MathML. A number of parsed entities (such as &harr; for U+2194 (LEFT RIGHT ARROW)) are defined or borrowed by the MathML DTD. But the restriction in EPUB 3.2 prohibits the use of such parsed entities in EPUB, unless people define parsed entities in the internal DTD subset So, we can safely assume that people use Unicode characters rather than parsed entities of MathML. Is this true? > @TzviyaSiegman

Here is my proposal.

  1. Lift the current restriction for SVG and MathML.
  2. Prohibits the use of references to entities with the exception of those parsed entities (such as &lt;) defined in the XML specification.

P.S. Some XML geeks might use the internal DTD subset for defining parsed entities. We might want to require a declaration in the internal DTD subset for each of the parsed entities used in the XML document (unless they are defined in the XML recommendation.)

murata2makoto avatar Apr 29 '20 09:04 murata2makoto

@murata2makoto

So, we can safely assume that people do not use parsed entities of MathML and that they use Unicode characters rather than such entities. Is this true?

I would certainly hope so... MathML 3, which is the latest recommendation of MathML, does not require the use of DTD-s (just as SVG1.2 will also do away with a required DTD). As you said, the specific mathematical characters can be taken care of by their Unicode equivalents these days.

(What I do not know, however, is what various converters, like LaTeX->MathML, do. I would think that most of the authors provide their equations in LaTeX, which is then converted into MathML. Some of those tools may be older and rely on DTD-s, just like the SVG files generated by Adobe Illustrator...)

iherman avatar Apr 29 '20 10:04 iherman

Lift the current restriction for SVG and MathML.

SMIL also uses DTD-s with entities. Do we allow explicit SMIL files in an EPUB?

As far as I can see SSML does not require DTD-s :-)

iherman avatar Apr 29 '20 10:04 iherman

SMIL also uses DTD-s with entities. Do we allow explicit SMIL files in an EPUB?

Media Overlay uses SMIL documents. But MO defines a very small subset of SMIL, which does not contain entities.

murata2makoto avatar Apr 29 '20 11:04 murata2makoto

SMIL also uses DTD-s with entities. Do we allow explicit SMIL files in an EPUB?

Media Overlay uses SMIL documents. But MO defines a very small subset of SMIL, which does not contain entities.

But the DTD does... Ie, even if a small subset is used but the specific SMIL file includes the DTD then we have a problem. It is exactly the same problem as with mine with SVG which started this thread :-(

iherman avatar Apr 29 '20 11:04 iherman

@murata2makoto

Here is my proposal.

  1. Lift the current restriction for SVG and MathML.
  2. Prohibits the use of references to entities with the exception of those parsed entities (such as <) defined in the XML specification.

We could also adopt an approach similar to the HTML Standard’s XML parsing rules:

  1. define a set of well-known public identifiers, and specify exactly what URL they correspond to (for all the relevant specs we could see if it makes sense to use the one specified in the HTML spec, containing the list of well-known named character references).
  2. say that user agents should not attempt to retrieve external entity's content.

rdeltour avatar Apr 29 '20 12:04 rdeltour

@iherman

It is exactly the same problem as with mine with SVG which started this thread :-(

But this problem is theoretical, since nobody would like to create SMIL documents containing document type declarations. I also think that references to parsed entities are not needed in MO documents.

@rdeltour

define a set of well-known public identifiers

Makes sense.

We could also adopt an approach similar to the HTML Standard’s XML parsing rules:

Are you proposing to allow references to entities defined or borrowed by the MathML DTD? If we allow them, XML processors that do not retrieve external DTD subsets cannot handle them.

say that user agents should not attempt to retrieve external entity's content

I also think that XML processors that retrieve external DTD subsets are not very useful for EPUB. But I am not sure if we have to say "should not attempt to retrieve" them, since somebody might want to validate SVG or MathML using DTDs.

murata2makoto avatar Apr 29 '20 16:04 murata2makoto

Should this issue be addressed by publishing errata to EPUB 3.2 or by the next version of EPUB 3 (by the upcoming WG)?

murata2makoto avatar May 01 '20 06:05 murata2makoto

Should this issue be addressed by publishing errata to EPUB 3.2 or by the next version of EPUB 3 (by the upcoming WG)?

If there is an erratum for EPUB 3.2, it ought to be taken up by the upcoming WG anyway. I do not think there is an urgency...

iherman avatar May 01 '20 08:05 iherman

I've discovered this old issue yesterday and may be missing something, but:

for instance for excluding this:

<!DOCTYPE html [ 
<!ENTITY new-release SYSTEM "https://bookseller.com/new-release.html"> 
]> 
<html> 
<body>
&news-release; 
</body>
</html> 
  • in the SVG case presented in this thread, the DTD is referenced in a standard manner; the content document does not present any external entity. The fact that the DTD itself presents external entities (a manner of modularizing DTDs) does not break the EPUB XML conformance constraint IMO.

  • Note that if the XML parser does not validate the SVG structure (and EPUB conformant processors don't need to validate EPUB content), the parser will not even encounter these external entities (nested in the DTD). Only EPUBcheck validates EPUB (+ SVG + MathML) content, for the sake of guarantying a better interoperability.

In conclusion, I don't see the point of this issue.

llemeurfr avatar Oct 10 '20 08:10 llemeurfr

  • The fact that the DTD itself presents external entities (a manner of modularizing DTDs) does not break the EPUB XML conformance constraint IMO.

This statement is, however, a bit ambiguous:

External identifiers MUST NOT appear in the document type declaration [XML].

Two ways of reading this is that the

  1. The SVG (ie, XML) Document Type Declaration MUST NOT contain external entities. I guess this is the current interpretation of epubcheck, yielding error messages. Or
  2. The SVG (ie, XML) source MUST NOT contain external entities, like your example above.

At the minimum, this ambiguity must be clarified in the spec if indeed only the second item is meant. That should then probably be followed up in the EPUB Issue.

In conclusion, I don't see the point of this issue.

This is the issue on epubcheck and not on the spec. There is a practical problem, as described in the original description of this issue. This issue is very much relevant until epubcheck is not changed to accept bona fide SVG files as valid.

If your interpretation is correct, then maybe the EPUB Issue can be closed although I think that, at the minimum, a clarification in the spec is necessary.


(We should stick to one issue to discuss all this, it is not helpful to switch between two different repositories...)

iherman avatar Oct 10 '20 10:10 iherman

@iherman

You are not the only person confused by the definition of DTDs.

A DTD is a pair of an external DTD subset and an internal DTD subset. Laurent's example document contains a DTD, which has an internal DTD subset only.

No external entities can occur in XML documents unless they are defined by DTDs.

murata2makoto avatar Oct 10 '20 10:10 murata2makoto

@iherman

You are not the only person confused by the definition of DTDs.

A DTD is a pair of an external DTD subset and an internal DTD subset. Laurent's example document contains a DTD, which has an internal DTD subset only.

No external entities can occur in XML documents unless they are defined by DTDs.

@murata2makoto I certainly yield to your XML knowledge. But then

  1. I believe a note or explanation in the EPUB spec is warranted to avoid such misunderstandings because implementations may go wrong (this is something for @mattgarrish I presume)
  2. Epubcheck must be updated to avoid this problem

iherman avatar Oct 10 '20 10:10 iherman

@iherman

I agree with your first point. Do not understand your second.

murata2makoto avatar Oct 10 '20 10:10 murata2makoto

@iherman

I agree with your first point. Do not understand your second.

If your interpretation of the EPUB Spec text is correct, then there is a bug in epubckeck. It should not flag an SVG DTD as an invalid content; it does today.

iherman avatar Oct 10 '20 12:10 iherman

We could solve the issue by looking at it from another point of view: XML external entities can either be used by an XML instance (as in my previous sample) or by the DTD itself, as in the modularized SVG 1.1 DTD.

The first case is harmful, the second is not (the reason being that reading systems don't validate XML instances before display).

We solve the issue by re-phrasing the constraint "External identifiers MUST NOT appear in the document type declaration" as (the context being the Publication Resource) "It MUST NOT make use of XML external entities defined in a document type declaration".

EPUBCheck will have to be modified to accept the presence of external entities in DTD and block only if such entites are used in XML instances.

llemeurfr avatar Oct 12 '20 10:10 llemeurfr

(Admin comment)

@llemeurfr linking this to the ongoing discussion on https://github.com/w3c/publ-epub-revision/issues/1338, in particular https://github.com/w3c/publ-epub-revision/issues/1338#issuecomment-707000732...

We should really continue this discussion on one thread only...

iherman avatar Oct 12 '20 10:10 iherman

Just a quick clarification, as I read the discussion of the past few days:

The EPUB 3.2 XML Conformance states that for any XML publication resource:

External identifiers MUST NOT appear in the document type declaration.

This unambiguously says that the SVG doctype declaration given by @iherman as an example is disallowed, which EPUBCheck reports (correctly, for now).

I don't see what's to be "interpreted" here, or what's the bug in EPUBCheck. Am I missing something?

Note that I'm not saying the spec is good there. The spec can be made less restrictive (I personally believe it should), but then again, that's for the WG to decide. I'm marking that issue as blocked, as it depends on updating the spec.

rdeltour avatar Oct 14 '20 14:10 rdeltour

This unambiguously says that the SVG doctype declaration given by @iherman as an example is disallowed, which EPUBCheck reports (correctly, for now).

Well, looking at https://github.com/w3c/epubcheck/issues/1114#issuecomment-706508997, https://github.com/w3c/epubcheck/issues/1114#issuecomment-706526867, https://github.com/w3c/epubcheck/issues/1114#issuecomment-706527683 it seems that things are not that clear.

iherman avatar Oct 14 '20 14:10 iherman

@iherman

it seems that things are not that clear.

OK, so let's put this differently 😊:

  • I assume we all agree that <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> is a document type declaration, correct?
  • in this declaration, PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" is an external identifier , composed of a public identifier ("-//W3C//DTD SVG 1.1//EN") and a system identifier ("http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"). Together, the public identifier and system identifier point to the external subset of the DTD.
  • then, the doctype is invalid in EPUB, as an external identifier literally appears in the document type declaration [edited for clarification].

Is it the second bullet above that you find ambiguous? 🤔

All I'm saying is that the EPUB conformance criteria doesn’t even goes into looking at the DTD content. It only says, "the doctype must not have a public identifier nor a system identifier".

Regarding the comments you reference:

Well, looking at #1114 (comment), #1114 (comment), #1114 (comment) it seems that things are not that clear.

  • @llemeurfr’s comment is about the spec intention ("I understand (…) as way to ensure that the content document is self-contained"), it doesn’t really say the spec as it is today is confusing.
  • @murata2makoto’ comment says that DTDs are generally confusing, again not the EPUB spec statement itself.
  • finally, your comment is where the confusion is, if I understand correctly.

You say (slightly reorganizing your comment):

Two ways of reading "External identifiers MUST NOT appear in the document type declaration" are

  • the Document Type Declaration MUST NOT contain external entities
  • The XML content MUST NOT contain external entities

The first one is true, as a corollary to the original statement. But really, the original statement suffice in itself to reject the SVG 1.1 doctype, with no need to a special "way of reading" or "interpretation" 😊

rdeltour avatar Oct 14 '20 19:10 rdeltour