pwpub
pwpub copied to clipboard
What is the `origin` of a packaged publication?
A precise answer to this question should (probably) be included in the document. This origin affects the way relative URI-s in the manifest are turned into absolute ones, it affects behaviors of scripts, etc.
The Readium document https://github.com/readium/architecture/blob/master/server/origin.md is also related to this issue, focusing on the problems Reading Systems are facing when setting the origin of content.
I had a discussion with @danielweck on this subject. Here is a summary, and I hope it will help some of us participating to this discussion.
Let's consider a Package; let's imagine that once exposed on the web (either statically after unpackaging or dynamically via a "publication server"), its manifest is served from https://domain.org:8080/pub_id/manifest.json
:
The origin of the manifest is therefore https://domain.org:8080
.
Once manifest.json is fetched by a user agent, this user agent will consider that the base URL for this resource is https://domain.org:8080/pub_id
, and all relative URLs will be 'absolutized' using this value.
Optionnaly, in https://github.com/json-ld/json-ld.org/issues/604, it seems that the JSON-LD WG has agreed that a @base property can override the default base URL inside the json structure, which mimics what exists with the
For sure, defining a base URL is not always simple: if the manifest is served from https://domain.org:8080/pub?pub_id=value
the base URL will be https://domain.org:8080/pub
for all publications fetched from this server. Resolution of base URLs can be surprising, as shown in this playground written by Daniel.
But in practice, what affects the processing of relative URIs in the manifest is the base URL associated with the manifest; and this base URL, for any web resource, incl. json-ld, is defined by standard web practice -> Document base URL
The case of a manifest embedded in the PEP was discussed in https://github.com/w3c/json-ld-syntax/issues/23. Maybe @iherman or @BigBlueHat can summarize the conclusion of this thread?
What if I, at publisher.org, created the package, and then sent it to you at retailer.com? The manifest would be served from retailer.com. If you consider retailer.com the origin of the publication, what's to stop the publisher from including malicious scripts that, for example, rewrite the DOM at retailer.com?
The case of a manifest embedded in the PEP was discussed in w3c/json-ld-syntax#23. Maybe @iherman or @BigBlueHat can summarize the conclusion of this thread?
I think the conclusion is what is in the current JSON-LD 1.1 draft:
When processing a JSON-LD script element, the Document Base URL of the containing HTML document, as defined in [HTML], is used to establish the default base IRI of the enclosed JSON-LD content.
The critical piece is the reference to the HTML spec which establishes the base URL for an HTML document.
The question is whether what @llemeurfr and @danielweck describe above stands or not for the index.html
file, too, i.e., whether this can be done so that the underlying HTML parser would be properly operational as well (and implementers would not have to create their own variant of an HTML parser).
@dauwhe, I wonder what a malicious publisher can do to hack the distributor's platform; could you detail what "rewrite the DOM" can be like and what can happen to the distributing platform?
@iherman IMO, the PEP index.html being an html resource, the way relative URLs are processed by web user agents is even clearer than json-ld processing: Document Base URL drives it.
@llemeurfr what I was worried about is to use an HTML parser by telling it, in some way or other, to use a specific and external base URL for which there is no standard. But, re-reading your comment, I realized that I did not understand what you meant by 'publication server'. Do you mean localhost
or the cloud server used for unpacking? If that is the case, then you are right, it is not a problem.
Of course, for those cases, we do have the types of problems described in the readium note. But, I wonder whether this should not be the point where we simply acknowledge that we do not define a perfect packaging format but a lightweight which does have its limitations (described in the note) and that the 'real' solution would be a future Web Packaging format that, somehow, would have take care of maintaining the origin of the content.
@iherman by "publication server" I mean any piece of software capable of exposing dynamically a packaged publication (LPF or EPUB format) as a Web Publication. In Readium speak we call it a "streamer".
Do you mean localhost or the cloud server used for unpacking?
Yes, such a middleware can expose the Web Publication with a localhost origin or a "web" origin (domain name, ip address), depending its usage (as part of a reading app or "on the web").
The problems exposed in the readium note have to do with the 'origin' of the Web Publication, not really its 'base URL' (and not the origin of the Packaged publication, as there is none); I was fooled by the title of this issue.
@llemeurfr
Optionally, in json-ld/json-ld.org#604, it seems that the JSON-LD WG has agreed that a
@base
property can override the default base URL inside the json structure, which mimics what exists with the element in HTML documents. But let's keep that on the side for now.
That is correct, although I am not sure we should rely on a strongly 1.1 feature; at the moment, all our manifest are JSON-LD 1.0 compatible it would be fairly difficult to explain the lambda users of our authored manifest what this would mean...
But already in JSON-LD 1.0 it was possible to use @base
, i.e., the manifest author could do something like
"@context" : [
"https://schema.org",
"https://www.w3.org/ns/wp-context",
{ "@base": "https://example.org"}
]
...
(I have just checked and the structured data testing tool indicates that this is accepted and properly handled by at least that schema.org processor.)
I am not sure how that would solve the problem at hand, however, because the big issue with the origin is to ensure that various javascripts have the right origin URL when they do, e.g., fetch to external resources...
That being said, canonicalization should be able to handle @base
and currently this is not done, see the issue I raised earlier today: https://github.com/w3c/wpub/issues/434
The problems exposed in the readium note have to do with the 'origin' of the publication, not really the 'base URL'; I was fooled by the title of this issue.
Aren't we talking about the same set of problems? https://domain.org:8080/pub_id
in your example is the base URL, yielding https://domain.org:8080
as the origin, so the problems in that note do apply...
@iherman @llemeurfr
For EPUB and LPF, there is truly no origin for these resources. To serve them, we have to adopt various strategies as described in the Readium document, but IMO these are technical implementation details rather than a true origin.