openwayback icon indicating copy to clipboard operation
openwayback copied to clipboard

Support ARC/WARC/.GZ type detection and loading

Open PsypherPunk opened this issue 11 years ago • 11 comments

S'sheet line: 25 For whom? NLN Notes: Fixed recently? Test suite? in Heritrix Commons. Probably only GZIP sniffing.

PsypherPunk avatar Dec 18 '13 10:12 PsypherPunk

As I recall, this means that when fetching (W)ARC[.gz] via HTTP, wayback should check the magic number at the beginning of the file to determine the type. Can somebody remind me of the use case here? Why doesn't using the URL file extension work (besides the fact that it is not a best practice) or using the Content-Type + Content-Encoding?

egh avatar Jan 09 '14 22:01 egh

It's possible that there isn't an extension - in our case WARCs are retrieved from our store by passing a unique ID. Wayback's default behaviour is (or definitely used to be) to presume anything lacking an extension is an ARC. We end up suffixing a "#bogus=warc.gz" fragment just to get around this.

Certainly if the HTTP server is capable of setting the content-type correctly (i.e. "application/warc" - not sure what ARCs would be?) OpenWayback should take advantage of this, which I don't think it currently does. However, in the case that the server simply gives "text/plain", "application/gzip" or good ol' "application/octet-stream", ideally OpenWayback should be able to accurately determine the type.

PsypherPunk avatar Jan 10 '14 10:01 PsypherPunk

Sounds like someone needs to write up a proposed standardization of content-type responses for ARC/WARC with and without compression.

kris-sigur avatar Jan 10 '14 10:01 kris-sigur

Thanks for the info. This all sounds a bit tricky. If we do implement sniffing, I wonder if there is any disadvantage to always using it. Otherwise we will have to decided what is preferred: mimetype, file extension, or sniffing.

egh avatar Jan 10 '14 19:01 egh

I was originally interested in resolving this for ARC/WARC source files, not over HTTP. The current codebase uses the file extension, but makes different assumptions in different places (there are at least two different content-guesser implementations). My 'wayback-player' app can't cope with uncompressed archives, because in that context the codebase always assumes compression is present. In other case, as @PsypherPunk said, we have to hack a 'file extension' to make Wayback happy because archives without extensions are assumed to be compressed ARC.

I've implemented ARC/WARC sniffing via Apache Tika before, and it would probably make sense just to reuse that for uncompressed data, and use a two-pass detector to cope with GZipped files.

I've been using application/warc and application/x-internet-archive because they appeared to be the most common forms. Not clear how to indicated block gzip, maybe something elike application/warc; encoding=concat-gzip would be useful?

anjackson avatar Jan 10 '14 20:01 anjackson

Since we can sniff gzip by checking the first 2 bytes of the file, I think that might be easier than pulling in Tika. Distinguishing between WARC and ARC would be pretty easy after that.

egh avatar Jan 10 '14 21:01 egh

Maybe application/warc+gzip?

egh avatar Jan 10 '14 21:01 egh

The problem is I think we should distinguish between concatenated gzip and plain gzip, and deliberately use an unfamiliar identifier so that users are aware of this distinction. Perhaps that's overkill. Certainly, application/warc+gzip would not be as easily mis-interpreted as application/warc; encoding=gzip.

Yes, sure, we can just reimplement the sniffing logic. I get a bit tired of messing about with the necessary buffering and resetting of input streams etc., so lean towards re-using existing implementations. However, I guess this specific case is simple enough - unlike my other use cases elsewhere, the first few bytes will definitely be sufficient, so a small fixed-size buffer will be ok.

anjackson avatar Jan 10 '14 22:01 anjackson

Well, concatenated gzip shouldn't matter to an end user, it only matters to people creating (W)ARC files, so I don't know if it is important to report it.

I don't think a mime parameter is the proper place to specify that the file is gzipped. We could use the Content-Encoding header if doesn't cause any issues.

egh avatar Jan 10 '14 22:01 egh

Via @ikreymer I found out that JWAT has WarcReaderFactory.isWarcRecord(), ArcReaderFactory.isArcRecord() and GzipReader.isGzipped(), so it might be possible to clean up the current duplicated sniffing code and use them instead.

anjackson avatar Feb 12 '14 13:02 anjackson

Webrecorder now writes compressed WARCs without a .gz extension, so this is one more reason to address this issue. Pull requests are welcome!

ldko avatar Aug 13 '19 14:08 ldko