warcio icon indicating copy to clipboard operation
warcio copied to clipboard

Streaming WARC/ARC library for fast web archive IO

Results 58 warcio issues
Sort by recently updated
recently updated
newest added

I am trying to use warcio to write WET files that hold text-only conversion records, but I am not able to find a way to write a record using warcio...

Here is an interesting one for you Ilya. The original NCSA 1.5 web server responds with "HTTP 200 Document follows" rather than HTTP/1.0. In recorderloader.py HTTP_TYPES is only looking for...

I'm doing some larger experiments with patching some WARC archives containing wordpress-based websites. Wordpress supports [latex here](https://wordpress.com/support/latex/), unfortunately they offer this through some endpoints that render the latex code to...

Test results: https://github.com/cclauss/warcio/actions

Per WARC/1.0 spec section 5.9: > The payload of an application/http block is its ‘entity-body’ (per [RFC2616]). The entity-body is the HTTP body *without transfer encoding* per [section 4.3 in...

## Overview When attempting to use `requests.Session` with `capture_http` in some kind of loop to create new WARC files, an error is raised. However, when using `requests` directly without the...

I'm using Python 3.10.4 and warcio 1.7.4 Using a piece of code based on https://github.com/webrecorder/warcio#writing-warc-records, I'm getting ``` for record in ArchiveIterator(writer.get_stream()): AttributeError: 'WARCWriter' object has no attribute 'get_stream'. Did...

One of the most likely problems we see is failed transfers leading to truncated WARC.GZ files. We can spot this with `gunzip -t` but it would be good if `warcio...

Hi, how to deal with such an error? I'm trying to convert a real old ARCs to use in SolrWayback ``` mw@webarch:~/solrwayback/indexing/warcs1$ warcio recompress test2.arc.gz test2.warc.gz WARNING: Record not followed...