jwarc
jwarc copied to clipboard
Utility methods to read payload body
Most consumers of the content payload require the payload to be
- decoded using the provided HTTP Content-Encoding
- available as byte[] (eg. Tika) or even String (eg. Jsoup)
I've found myself writing similar code when consuming the payload body of WarcResponse records: jwarc's extract tool #41, a sitemap tester and StormCrawler. In order to make jwarc more usable, I'd propose to bundle the following functionality in two/few utility methods:
- return the decoded payload body as channel using the HTTP
Content-Encoding
- with configurable behavior (fail or return payload without decoding) when
Content-Encoding
isn't understood or is not reliable (gzip without gzip magic/header) - ev. make it possible to pass decoders for encodings not supported by jwarc, eg.
brotli
(I assume that jwarc is designed to have zero dependencies) - or should the decoding functionality provided in a class HttpPayload extending WarcPayload?
- with configurable behavior (fail or return payload without decoding) when
- read the (decoded) payload into byte[] (or ByteBuffer)
- optionally limit the max. size of the byte[] array to ensure that oversized captures do not cause any issues
Having something like a decode()
or bodyDecoded()
convenience method on both HttpMessage and WarcPayload that decodes the content encoding seems reasonable to me.
record.payload().decode() -> MessageBody?
response.http().decode() -> MessageBody?
I think we could make brotli an optional maven dependency and if it's present on the classpath we use it.
read the (decoded) payload into byte[] (or ByteBuffer)
Note that from Java 9 you can do body().stream().readAllBytes()
and body().stream().readNBytes(buf, off, len)
. Not opposed to having our own as there's still quite a few people targeting 8 though.