jwarc
jwarc copied to clipboard
Java library for reading and writing WARC files with a typed API
This is a very minor API issue I ran into. I wanted to define a custom record type instead of tacking weird nonstandard semantics onto revisits. This proved difficult since...
It would be nice to have an option to output in CDXJ format. Pywb's cdx-indexer uses the command-line option "-j, --cdxj" for that so it'd be nice if we support...
Hi, I try to recording a warc with jwarc in proxy mode and anything browser I use fail. For run jwarc in proxy mode I used this commands: ```shell export...
Most consumers of the content payload require the payload to be 1. decoded using the provided HTTP Content-Encoding 2. available as byte[] (eg. Tika) or even String (eg. Jsoup) I've...
It'd be nice to make use of the Java 11 versions of inflate() and deflate() so that buffers that aren't array-backed can be used. One option would be to produce...
The [ClueWeb09](http://www.lemurproject.org/clueweb09/datasetInformation.php) dataset WARC files (see [sample files](http://www.lemurproject.org/clueweb09/sampleFiles.php)) use a single line feed `\n` as separator between WARC headers. The WarcParser expects `\r\n` (which would conform to the standard) and...
WARC writers may provide a [gzip extra field](http://zlib.org/rfc-gzip.html#extra) "sl" (recommended by [WARC 0.9](http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html#anchor25) but dropped in newer versions) to encode the length of the compressed WARC record. This can be...
I am wondering, will it be helpful to add a Dockerfile in the repo that includes Chromium/Google Chrome and other run-time requirements to make all the tools function as expected?
I noticed that replaying WARCs provides a 14-digit datetime placeholder. As I anticipate this will eventually be semantic, it need not necessarily be. However, providing Memento ([RFC7089](https://tools.ietf.org/html/rfc7089)) HTTP response headers...
`jwarc filter resource | jwarc exec file` `jwarc filter image | jwarc exec montage`