epubcheck icon indicating copy to clipboard operation
epubcheck copied to clipboard

URL normalization of OCF resources

Open rdeltour opened this issue 4 months ago • 8 comments

Since #1628 EPUBCheck applies Unicode normalization of URLs to check if they match an existing resource in the OCF container.

This change was motivated by issue #1606, where the user was confused by EPUBCheck reporting errors for URLs that were not using the same code points as the file name to represent a umlaut diacritic.

For example:

  • given a file named schön.xhtml in the OCF,
  • and its name is coded with the precomposed character ö (U+00F6)
  • but its name is not coded as an o (U+006F) combined with the diaeresis ◌̈ (U+0308)
  • then the URL sch%C3%B6n.xhtml did match the file name
  • but the URL scho%CC%88.xhtml did not match the file name

Theoretically, the specification currently says about file URLs (warning: spec links beloved are to a dated spec version):

  1. the content URL of a file is in the OCF is determined after the file's path (section URLs in the OCF abstract container)
  2. the file path is defined as determined by the algorithm Deriving file paths
  3. in turn, the "derive the file path" algorithm constructs the path with the file name
  4. file names are defined in OCF file paths and file names as scalar value strings, i.e. a sequence of code points.

Concretely, that means that according to a strict interpretation of the specification, in the example above, sch%C3%B6n.xhtml is a valid URL for the given file, but scho%CC%88.xhtml is not.

But that interpretation may be too strict, and tests show that reading systems (and browsers) apply some URL normalization and will render the file correctly regardless of the decomposition form used in the URL. That's why I decided to accept that in EPUBCheck, at the cost of diverging from a strict interpretation of the spec.

So, the questions are:

  • is it acceptable that EPUBCheck does not report an error for a mismatch in the URL and file name encoding of composite characters?
  • shouldn't the spec be explicit about allowing URLs in both forms? like, explicitly talk about doing some kind of URL normalization when checking matching URL to OCF content? If yes, we should open a new spec issue.

@mattgarrish @iherman, thoughts welcome!

rdeltour avatar Sep 09 '25 09:09 rdeltour

@rdeltour,

  • shouldn't the spec be explicit about allowing URLs in both forms? like, explicitly talk about doing some kind of URL normalization when checking matching URL to OCF content? If yes, we should open a new spec issue.

I believe that should be considered as a bug in the spec; there should not be any ambiguity in this respect. Specifically, I believe that both forms (precomposed and combined) should be accepted as equals for our purpose (i.e., I am fully in favor of what epubcheck is doing). Indeed, the average end-user has no control over these things, and it may lead to difficult bugs otherwise (e.g., I know I can use accented characters on the file names on my MacOS, but I have no idea which forms are used by the system nor how I could change that!).

I tried to chase down what change should be made to the spec to make this unambiguous, although I presume you know the whole parsing and comparison process better than anyone (I believe you authored those parts of the spec...). So far the only point I found in the spec that may require some update is in the file path and file names section. The current text says (towards the end of the section):

All file names within the same directory MUST be unique following Unicode canonical normalization [uax15] and then full case folding [unicode]. (Refer to Unicode Canonical Case Fold Normalization Step [charmod-norm] for more information.)

However, the reference to uax15 is ambiguous to me. Indeed, that document refers to several normalization forms and, in the EPUB spec, we are not specifying which form we use in the first place. More exactly, the text refers to the Unicode Canonical Case Fold Normalization Step but only in a "for more information" clause. Why don't we refer to it as a normative requirement and require that to be used? Wouldn't that specify exactly what epubcheck does, @rdeltour?

(I know there are W3C procedural issues insofar as charmod-norm is a WG Note. However, I would think we could argue that the points listed in the W3C normative references requirements would apply...)

iherman avatar Sep 09 '25 15:09 iherman

Why don't we refer to it as a normative requirement and require that to be used?

Here's the history on it: https://github.com/w3c/epub-specs/pull/1648

mattgarrish avatar Sep 09 '25 15:09 mattgarrish

Why don't we refer to it as a normative requirement and require that to be used?

Here's the history on it: w3c/epub-specs#1648

Sigh. Well, would we have to reopen this issue with this epubcheck problem?

iherman avatar Sep 09 '25 15:09 iherman

I'd open a new one but cite this issue and that pull request. We'll need the i18n folks to look at whatever we do and they've probably forgotten those old discussions.

mattgarrish avatar Sep 09 '25 16:09 mattgarrish

@iherman

So far the only point I found in the spec that may require some update is in the file path and file names section. The current text says (towards the end of the section):

All file names within the same directory MUST be unique following Unicode canonical normalization [uax15] and then full case folding [unicode]. (Refer to Unicode Canonical Case Fold Normalization Step [charmod-norm] for more information.)

That spec requirement is related, but does not directly impact out current issue. Basically, that requirement is there to prevent the existence of two files in the OCF that would differ only in case or character composition. EPUBCheck has checked this for a while.

The issue here is, given a file, how to tell when a URL references this file? For example for the file schön.xhtml in the OCF, do both sch%C3%B6n.xhtml, and scho%CC%88.xhtml reference the file or only the first?

I'm not highly confident we can easily specify this in EPUB, as the concept of URL equivalence is quite complex. RF3986 had a whole section on URL normalization and comparison. The URL standard dropped this. There are open issues related to this, notably a proposal to add a normalization API. But equivalence is also largely protocol dependent. For instance, RFC9110 about HTTP semantics has a section on normalization and comparison referring to the RFC3986 section mentioned above.

Anyways, any effort to further specify this would include research of how it's handled on the Web, proper testing of existing RS, etc. Good times ahead!

rdeltour avatar Sep 10 '25 10:09 rdeltour

As a little test, consider the following URLs

http URLs:

  1. https://github.com/w3c/epubcheck/blob/main/src/test/resources/epub3/04-ocf/files/ocf-container-filename-character-composition-valid/EPUB/content_ü_001.xhtml
  2. https://github.com/w3c/epubcheck/blob/main/src/test/resources/epub3/04-ocf/files/ocf-container-filename-character-composition-valid/EPUB/content_%C3%BC_001.xhtml
  3. https://github.com/w3c/epubcheck/blob/main/src/test/resources/epub3/04-ocf/files/ocf-container-filename-character-composition-valid/EPUB/content_u%CC%88_001.xhtml

file URLs:

  1. file:///epubcheck/epubcheck/src/test/resources/epub3/04-ocf/files/ocf-container-filename-character-composition-valid/EPUB/content_ü_001.xhtml
  2. file:///Users/romain/Work/epubcheck/epubcheck/src/test/resources/epub3/04-ocf/files/ocf-container-filename-character-composition-valid/EPUB/content_%C3%BC_001.xhtml
  3. file:///epubcheck/epubcheck/src/test/resources/epub3/04-ocf/files/ocf-container-filename-character-composition-valid/EPUB/content_u%CC%88_001.xhtml

A quick test on macOS (latest Sequoia, with latest browser versions) give these results:

URL Safari Chrome Firefox
http.1
http.2
http.3
file.1
file.2
file.3

So these are not entirely consistent, test URL "http.3" does not resolve on Safari but does on other browsers.

rdeltour avatar Sep 10 '25 10:09 rdeltour

Hm. From an outsider's point of view, I would consider the behavior of Safari on http.3 as a bug. But I acknowledge I did not dive into all the details of the URL specifications.

Would it be a possibility to contact someone from Apple, like Tess (she is also the AC rep for Apple, so it is justified to do so)? Alternatively, we raise this as a TAG issue.

Nevertheless, even if it is not a full answer, isn't it true that https://github.com/w3c/epub-specs/pull/1648 should indeed be reopened insofar as making the reference to NFC normative in EPUB?

iherman avatar Sep 10 '25 14:09 iherman

Hm. From an outsider's point of view, I would consider the behavior of Safari on http.3 as a bug. But I acknowledge I did not dive into all the details of the URL specifications.

To be honest I'm not sure I fully understand what happens. The parsing is not supposed to percent-decode the path. The server (in the example, gitHub.com) is responsible for matching the URL to the actual resource; that's when URL normalization would happen. But I do not

When checking on a console, URL.parse return consistent results in Safari/Chrome/Firefox. I don't know at what point Safari differs from the other two.

Nevertheless, even if it is not a full answer, isn't it true that w3c/epub-specs#1648 should indeed be reopened insofar as making the reference to NFC normative in EPUB?

I don't think there is too much ambiguity in what the specification means. The spec talks about Unicode canonical normalization, so this strikes out NFKC and NFKD (which are about compatibility equivalence, not canonical equivalence). Given the scope is to check for uniqueness, performing either NFD or NFC gives the same result.

That said, I'll let you editors decide if it's worth reconsidering w3c/epub-specs#1648 to clarify that wording or normatively refer to charmod-norm.

rdeltour avatar Sep 15 '25 10:09 rdeltour