jwarc icon indicating copy to clipboard operation
jwarc copied to clipboard

Utility methods to read payload body

Open sebastian-nagel opened this issue 4 years ago • 1 comments

Most consumers of the content payload require the payload to be

  1. decoded using the provided HTTP Content-Encoding
  2. available as byte[] (eg. Tika) or even String (eg. Jsoup)

I've found myself writing similar code when consuming the payload body of WarcResponse records: jwarc's extract tool #41, a sitemap tester and StormCrawler. In order to make jwarc more usable, I'd propose to bundle the following functionality in two/few utility methods:

  • return the decoded payload body as channel using the HTTP Content-Encoding
    • with configurable behavior (fail or return payload without decoding) when Content-Encoding isn't understood or is not reliable (gzip without gzip magic/header)
    • ev. make it possible to pass decoders for encodings not supported by jwarc, eg. brotli (I assume that jwarc is designed to have zero dependencies)
    • or should the decoding functionality provided in a class HttpPayload extending WarcPayload?
  • read the (decoded) payload into byte[] (or ByteBuffer)
    • optionally limit the max. size of the byte[] array to ensure that oversized captures do not cause any issues

sebastian-nagel avatar Jun 21 '20 13:06 sebastian-nagel

Having something like a decode() or bodyDecoded() convenience method on both HttpMessage and WarcPayload that decodes the content encoding seems reasonable to me.

record.payload().decode() -> MessageBody?
response.http().decode() -> MessageBody?

I think we could make brotli an optional maven dependency and if it's present on the classpath we use it.

read the (decoded) payload into byte[] (or ByteBuffer)

Note that from Java 9 you can do body().stream().readAllBytes() and body().stream().readNBytes(buf, off, len). Not opposed to having our own as there's still quite a few people targeting 8 though.

ato avatar Jun 22 '20 12:06 ato