bagitspec
bagitspec copied to clipboard
How should a resource in fetch.txt be content negotiated?
When resolving a URL in fetch.txt
, you may get different results depending on content negotiation. Therefore you may get a different resource back (e.g. HTML instead of JSON) depending on the browser and client setting used to retrieve such a resource - obviously if you get the "wrong one" the bagit checksums will be wrong.
I think the specification should recognize this, and perhaps specify the default Accept
headers to use, e.g.:
Accept-Language: *
Accept-Charset: *
Accept: application/octet-stream, */*;q=0.1
The headers Accept-Language
and Accept-Charset
may be excluded as their default is *
.
An alternative would be to include a ETAG for the particular rendering of the data object for which the checksum applied. Simply stating the accept type is not enough, because you could have different encoding of the octet stream. With this approach the line in the fetch.txt file would be
filename HTTP_URL ETAG
and a GET with the ETAG and a strong validation requirement would ensure that you get the exact same encoding (and data) that the author intended. Of course this requires that we know the ETAG for the rendering that we want when we construct the reference, and the server is properly implemented.
Hmm.. not so sure about ETag here, you can't GET a particular Etag, just use it for conditional methods, with the If-Match header (or the inverse If-None-Match
) - which would still require the Accept*
headers to compare the Etag of the correct representation.
As for knowing you have the correct representation after retrieval, (depending on #8), you could just check the checksum in the manifest-*.txt
for the remote resource.
ETags are however more powerful than such checksums, as they can be modified only for structural or semantic equivalence rather than byte-wise equivalence - e.g. a JSON representation that is constructed on the fly from an ElasticSearch instance.
There is nothing stopping a web-server from issuing a new ETag even when the content is byte-wise the same, e.g. because it is running on a newer version of Apache HTTP (indeed, some Apache installs used inode numbers and timestamps, so moving a file might change the ETag). Remember ETags are only intended for caching where the fallback is simply "just download it again".
This fits into #7 - are you meant to refresh a file from fetch.txt if it has changed on the server (new ETag) or has expired according to its cache headers?
Hi,
Yes, you are right.
I think I’ve been thinking about this wrong. The fetch.txt is just a mapping from a local name to an actionable URI. For the bag to be complete and valid, there must be some set of steps where that actionable URI can be converted into a set of bits that have the correct checksum, however the exact nature of those steps is out of scope for the spec as it will depend on the nature of the URI and the services implemented to provide the contents. The only question is that is having the URI (and I assume the desired checksum) enough information or will there be cases in which additional hints that cannot be encoded in the URI would be needed.
Carl
Dr. Carl Kesselman Dean’s Professor, Epstein Department of Industrial and Systems Engineering Fellow, Information Sciences Institute Viterbi School of Engineering
Professor, Preventive Medicine Keck School of Medicine
University of Southern California 4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292-6695 Phone: +1 (310) 448-9338 Email: [email protected]:[email protected] Web: http://www.isi.edu/~carl
On Jul 29, 2015, at 3:00 PM, Stian Soiland-Reyes <[email protected]mailto:[email protected]> wrote:
Hmm.. not so sure about ETag here, you can't GET a particular Etag, just use it for conditional methods, with the If-Match headerhttp://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.24 (or the inverse If-None-Match) - which would still require the Accept* headers to compare the Etag of the correct representation.
As for knowing you have the correct representation after retrieval, (depending on #8https://github.com/jkunze/bagitspec/issues/8), you could just check the checksum in the manifest-*.txt for the remote resource.
ETags are however more powerful than such checksums, as they can be modified only for structural or semantic equivalence rather than byte-wise equivalence - e.g. a JSON representation that is constructed on the fly from an ElasticSearch instance.
There is nothing stopping a web-server from issuing a new ETag even when the content is byte-wise the same, e.g. because it is running on a newer version of Apache HTTP (indeed, some Apache installs used inode numbers and timestamps, so moving a file might change the ETag). Remember ETags are only intended for caching where the fallback is simply "just download it again".
This fits into #7https://github.com/jkunze/bagitspec/issues/7 - are you meant to refresh a file from fetch.txt if it has changed on the server (new ETag) or has expired according to its cache headers?
— Reply to this email directly or view it on GitHubhttps://github.com/jkunze/bagitspec/issues/10#issuecomment-126108986.
Yes, whenever a server provides a Vary
header, then you cannot reliably use it in fetch.txt
- unless there is a Content-Location
header which you could use instead of the requested URL.
Anything doing non-RESTful stuff like authentication-based resources (e.g. http://example.com/me that varies per user), Cookies, etc. are also out.
I guess HTTP redirects are OK, but could be a warning sign. Some content-negotiations on Linked Data result in a redirect based on your headers, after which you can download the final URL.
E.g. http://purl.uniprot.org/uniprot/M0R3D1 with Accept: text/turtle
ultimately takes you to http://www.uniprot.org/uniprot/M0R3D1.ttl - clicking the first link in the browser shows a friendly HTML page instead which you probably don't want in your Bag.
Semantically the first is the protein identifier which is what we really want to aggregate, while the second is the representation that you want in fetch.txt
- at least as we don't have a way to specify the Accept
header.
The fetch.txt
was really more of a hack to facilitate the parallelized transfer of large bags using standardized tools, rather than something more exotic (Signiant, GridFTP, et. al.) that we couldn't afford and/or comprehend. Created before the heady days of REST and content negotiation were really well-understood (by me, anyway), there was an implicit expectation of accepting whatever it was the server decided to send down, since it was being as an opaque blob of bits anyway. The liberal, accept headers you suggest above, @stain, seems to me the closest approximation to that intention.
But really, this might just be considered a feature and left as a problem for humans to work out. That the fetch.txt
is for transfers also implies some sort of collaboration or work going on between the parties in the ephemeral now. If you GET an asset, and don't get the fixity you expect, you'll probably be emailing somebody on the other end to talk about it. Explicitly specifying the Accept headers doesn't actually do much to address the possibility of the server giving you back something you didn't expect.
No, but a piece of software that is trying to complete the bag does not have the luxury of contacting the person responsible. It can however talk say the HTTP protocol, so I think we should have a minimum acknowledgement of this issue (perhaps written in a protocol-neutral way) so that different bagit software do this in somewhat similar ways.
I would agree with this. I've come to realize that in the HTTP space, there seems to really be nothing you can do generically to force a server to give you what you want. You resolve the name, and hope that the object is hosted by a service that offers bitwise perfection as part of its policy. Once you get the bits back, you do have the checksum, so you can validate that you got the bits or you didn't. As you point out, this can get complicated especially when we consider a variety of URIs such as DOIs or ARKs or just plain old URLs. We might want to consider a couple of non-normative examples in the spec?
I too was hoping we could leverage the ETag
field as @carlkesselman mentioned, but I think I may have to agree with @stain. Given the w3c specs as they are then, it seems that we can't expect persistence and stability as anything other than a 'matter of service'. Ultimately, from the perspective of the end-to-end principle, this doesn't change the fact that (URL, checksum) pairs or some other form of (location, identity) pairs are what allow the consumer to determine whether they can find and verify the desired representation of the desired resource.
But still, in the interest of reducing some avoidable errors caused by content negotiation, it seems desirable to allow the client to make a more specific request within the bounds of the protocol they are using. Would it be reasonable to suggest that the fetch.txt support:
filename URL [list of protocol-specific options]
Where those options could include Accept
headers in the case of HTTP URLs, while allowing for other protocol-specific options?
The current format is unfortunately not extensible in any way, as it is:
URL LENGTH FILENAME
with space-escaping etc required for the URL
, but not for FILENAME
. Thus a valid line currently could be
http://example.com/file.txt 512 data/folder with spaces/filename with spaces.txt
The only other possibility here (beyond minting magic URL schemes) is negative numbers below -1, e.g. -2
could mean "Should already exist, was downloaded from here", -3
could mean "Refresh from here" (as in #7), etc. This feels a lot like 1980s C programming, though.
Perhaps there can be optional indented lines below? These can just be RFC822-style headers that are protocol-specific for the client.
If size is unknown, then -
can be used instead of number of bytes.
http://example.com/file1.txt 512 data/no-special-headers.txt
http://example.com/file2 8192 data/negotiated.en.html
Accept: text/html; application/xhtml+xml,q=0.9
Accept-Language: en
http://example.com/file2 1024 data/negotiated.jsonld
Accept: application/json
ftp://example.org/file3.txt - data/unknown-size.txt
gsiftp://example.net/file4.bam 17179869184 data/quite-large-over-gridftp.bam?cc=1;tcpbs=10M;P=4
I'm somewhat mixed on whether fetch.txt should be part of bagit at all — it's fine for the simple cases but the additional possible features I've seen requests for start to quickly take on as much complexity as the entire BagIt spec.
Since BagIt allows arbitrary top-level tag files, in general I would pose the question of whether we should do anything other than plan to freeze/deprecate fetch.txt and encourage people to use something like Metalink (aka RFC 5854) with a well-known filename. That'd get out of the box a more complete spec and clients (e.g. curl) with support for things like mirroring without duplicating that work in the BagIt world.
In this specific case, I'm also a little skeptical about using content negotiation in this context. The web as whole has been slowly moving away from it due to the complexity cost and in our context I'd worry that attempting to have bit-level fixities for negotiated content is going to be uncommon for that reason.