jwarc icon indicating copy to clipboard operation
jwarc copied to clipboard

Java library for reading and writing WARC files with a typed API

Results 13 jwarc issues
Sort by recently updated
recently updated
newest added

This is a very minor API issue I ran into. I wanted to define a custom record type instead of tacking weird nonstandard semantics onto revisits. This proved difficult since...

It would be nice to have an option to output in CDXJ format. Pywb's cdx-indexer uses the command-line option "-j, --cdxj" for that so it'd be nice if we support...

enhancement
pull request welcome

Hi, I try to recording a warc with jwarc in proxy mode and anything browser I use fail. For run jwarc in proxy mode I used this commands: ```shell export...

Most consumers of the content payload require the payload to be 1. decoded using the provided HTTP Content-Encoding 2. available as byte[] (eg. Tika) or even String (eg. Jsoup) I've...

It'd be nice to make use of the Java 11 versions of inflate() and deflate() so that buffers that aren't array-backed can be used. One option would be to produce...

The [ClueWeb09](http://www.lemurproject.org/clueweb09/datasetInformation.php) dataset WARC files (see [sample files](http://www.lemurproject.org/clueweb09/sampleFiles.php)) use a single line feed `\n` as separator between WARC headers. The WarcParser expects `\r\n` (which would conform to the standard) and...

WARC writers may provide a [gzip extra field](http://zlib.org/rfc-gzip.html#extra) "sl" (recommended by [WARC 0.9](http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html#anchor25) but dropped in newer versions) to encode the length of the compressed WARC record. This can be...

idea

I am wondering, will it be helpful to add a Dockerfile in the repo that includes Chromium/Google Chrome and other run-time requirements to make all the tools function as expected?

help wanted

I noticed that replaying WARCs provides a 14-digit datetime placeholder. As I anticipate this will eventually be semantic, it need not necessarily be. However, providing Memento ([RFC7089](https://tools.ietf.org/html/rfc7089)) HTTP response headers...

idea

`jwarc filter resource | jwarc exec file` `jwarc filter image | jwarc exec montage`

idea