openwayback
openwayback copied to clipboard
Support ARC/WARC/.GZ type detection and loading
S'sheet line: 25 For whom? NLN Notes: Fixed recently? Test suite? in Heritrix Commons. Probably only GZIP sniffing.
As I recall, this means that when fetching (W)ARC[.gz] via HTTP, wayback should check the magic number at the beginning of the file to determine the type. Can somebody remind me of the use case here? Why doesn't using the URL file extension work (besides the fact that it is not a best practice) or using the Content-Type + Content-Encoding?
It's possible that there isn't an extension - in our case WARCs are retrieved from our store by passing a unique ID. Wayback's default behaviour is (or definitely used to be) to presume anything lacking an extension is an ARC. We end up suffixing a "#bogus=warc.gz" fragment just to get around this.
Certainly if the HTTP server is capable of setting the content-type correctly (i.e. "application/warc" - not sure what ARCs would be?) OpenWayback should take advantage of this, which I don't think it currently does. However, in the case that the server simply gives "text/plain", "application/gzip" or good ol' "application/octet-stream", ideally OpenWayback should be able to accurately determine the type.
Sounds like someone needs to write up a proposed standardization of content-type responses for ARC/WARC with and without compression.
Thanks for the info. This all sounds a bit tricky. If we do implement sniffing, I wonder if there is any disadvantage to always using it. Otherwise we will have to decided what is preferred: mimetype, file extension, or sniffing.
I was originally interested in resolving this for ARC/WARC source files, not over HTTP. The current codebase uses the file extension, but makes different assumptions in different places (there are at least two different content-guesser implementations). My 'wayback-player' app can't cope with uncompressed archives, because in that context the codebase always assumes compression is present. In other case, as @PsypherPunk said, we have to hack a 'file extension' to make Wayback happy because archives without extensions are assumed to be compressed ARC.
I've implemented ARC/WARC sniffing via Apache Tika before, and it would probably make sense just to reuse that for uncompressed data, and use a two-pass detector to cope with GZipped files.
I've been using application/warc and application/x-internet-archive because they appeared to be the most common forms. Not clear how to indicated block gzip, maybe something elike application/warc; encoding=concat-gzip would be useful?
Since we can sniff gzip by checking the first 2 bytes of the file, I think that might be easier than pulling in Tika. Distinguishing between WARC and ARC would be pretty easy after that.
Maybe application/warc+gzip
?
The problem is I think we should distinguish between concatenated gzip and plain gzip, and deliberately use an unfamiliar identifier so that users are aware of this distinction. Perhaps that's overkill. Certainly, application/warc+gzip
would not be as easily mis-interpreted as application/warc; encoding=gzip
.
Yes, sure, we can just reimplement the sniffing logic. I get a bit tired of messing about with the necessary buffering and resetting of input streams etc., so lean towards re-using existing implementations. However, I guess this specific case is simple enough - unlike my other use cases elsewhere, the first few bytes will definitely be sufficient, so a small fixed-size buffer will be ok.
Well, concatenated gzip shouldn't matter to an end user, it only matters to people creating (W)ARC files, so I don't know if it is important to report it.
I don't think a mime parameter is the proper place to specify that the file is gzipped. We could use the Content-Encoding
header if doesn't cause any issues.
Via @ikreymer I found out that JWAT has WarcReaderFactory.isWarcRecord(), ArcReaderFactory.isArcRecord() and GzipReader.isGzipped(), so it might be possible to clean up the current duplicated sniffing code and use them instead.
Webrecorder now writes compressed WARCs without a .gz extension, so this is one more reason to address this issue. Pull requests are welcome!