NOT READY: warcio test
An opinionated WARC standards-conformance tool.
Ready for review - I have yet to work on test coverage.
$ warcio test test/data/*.warc.gz test/data/*.warc
test/data/example-bad-non-chunked.warc.gz
saw exception
ERROR: non-chunked gzip file detected, gzip block continues
beyond single record.
This file is probably not a multi-member gzip but a single gzip file.
To allow seek, a gzipped WARC must have each record compressed into
a single gzip member and concatenated together.
This file is likely still valid and can be fixed by running:
warcio recompress <path/to/file> <path/to/new_file>
skipping rest of file
test/data/example-resource.warc.gz
WARC-Record-ID <urn:uuid:6e7f60da-2c7b-11e7-aaf7-0242ac120007>
WARC-Type resource
digest pass
comment: unknown field, no validation performed Warc-Referer https://webrecorder.io/temp-GRWZVUTV/temp/test/record/http://example.com/
comment: unknown field, no validation performed Warc-User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36
test/data/example.warc.gz
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest not present
error: WARC-IP-Address should be used for http and https requests
WARC-Record-ID <urn:uuid:e6e395ca-0221-11e7-a18d-0242ac120005>
WARC-Type revisit
digest present but not checked
recommendation: missing recommended header WARC-Refers-To
comment: field was introduced after this warc version WARC-Refers-To-Target-URI http://example.com/ 1.0
comment: field was introduced after this warc version WARC-Refers-To-Date 2017-03-06T04:02:06Z 1.0
WARC-Record-ID <urn:uuid:e6e41fea-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest not present
error: WARC-IP-Address should be used for http and https requests
test/data/example-wget-bad-target-uri.warc.gz
WARC-Record-ID <urn:uuid:CEF11DC9-8D86-4F4B-9B8C-2235515B4537>
WARC-Type request
digest pass
error: uri must not be within <> warc-target-uri <http://example.com/>
error: invalid uri scheme, bad character warc-target-uri <http://example.com/>
WARC-Record-ID <urn:uuid:FD8A6D04-AF8A-4A36-A889-8094487CDF2D>
WARC-Type response
payload digest failed sha1:B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A
error: uri must not be within <> warc-target-uri <http://example.com/>
error: invalid uri scheme, bad character warc-target-uri <http://example.com/>
WARC-Record-ID <urn:uuid:E5AC383F-F107-47BC-99B7-176FD8DE6E94>
WARC-Type metadata
digest pass
error: uri must not be within <> warc-target-uri <metadata://gnu.org/software/wget/warc/MANIFEST.txt>
error: invalid uri scheme, bad character warc-target-uri <metadata://gnu.org/software/wget/warc/MANIFEST.txt>
WARC-Record-ID <urn:uuid:543BCA4F-A305-4383-B511-0BCF23F7AD8D>
WARC-Type resource
digest pass
error: uri must not be within <> warc-target-uri <metadata://gnu.org/software/wget/warc/wget_arguments.txt>
error: invalid uri scheme, bad character warc-target-uri <metadata://gnu.org/software/wget/warc/wget_arguments.txt>
WARC-Record-ID <urn:uuid:CCD67DB5-13FA-447B-BF05-BF1BDC8ED3A0>
WARC-Type resource
digest pass
error: uri must not be within <> warc-target-uri <metadata://gnu.org/software/wget/warc/wget.log>
error: invalid uri scheme, bad character warc-target-uri <metadata://gnu.org/software/wget/warc/wget.log>
test/data/example-wrong-chunks.warc.gz
saw exception Invalid WARC record, first line: <!doctype html>
skipping rest of file
test/data/post-test.warc.gz
WARC-Record-ID <urn:uuid:59a6b068-cbc2-4767-9525-33043d2709c7>
WARC-Type request
digest pass
error: WARC-IP-Address should be used for http and https requests
WARC-Record-ID <urn:uuid:5eb8ee92-cda1-4503-a7a3-c63f1ab6515b>
WARC-Type request
digest pass
error: WARC-IP-Address should be used for http and https requests
WARC-Record-ID <urn:uuid:c79a62e3-5a4b-450d-a093-3a7fefa09664>
WARC-Type request
digest pass
error: WARC-IP-Address should be used for http and https requests
test/data/example-digest-bad.warc
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
payload digest failed: sha1:1112H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
error: WARC-IP-Address should be used for http and https requests
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest pass
error: WARC-IP-Address should be used for http and https requests
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest pass
error: WARC-IP-Address should be used for http and https requests
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest pass
error: WARC-IP-Address should be used for http and https requests
test/data/example-iana.org-chunked.warc
WARC-Record-ID <urn:uuid:c46fbf5f-0876-4652-a348-e9b6c322eabb>
WARC-Type request
digest pass
error: WARC-IP-Address should be used for http and https requests
test/data/example-trunc.warc
WARC-Record-ID <urn:uuid:a9c51e3e-0221-11e7-bf66-0242ac120005>
WARC-Type response
block digest failed: sha1:DR5MBP7OD3OPA7RFKWJUD4CTNUQUGFC5
payload digest failed sha1:G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK
WARNING: Record not followed by newline, perhaps Content-Length is invalid
Offset: 2560
Remainder: b'\x00\x00\r\n'
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest not present
error: WARC-IP-Address should be used for http and https requests
test/data/example.warc
WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest not present
error: WARC-IP-Address should be used for http and https requests
WARC-Record-ID <urn:uuid:e6e395ca-0221-11e7-a18d-0242ac120005>
WARC-Type revisit
digest present but not checked
recommendation: missing recommended header WARC-Refers-To
comment: field was introduced after this warc version WARC-Refers-To-Target-URI http://example.com/ 1.0
comment: field was introduced after this warc version WARC-Refers-To-Date 2017-03-06T04:02:06Z 1.0
WARC-Record-ID <urn:uuid:e6e41fea-0221-11e7-8fe3-0242ac120007>
WARC-Type request
digest not present
error: WARC-IP-Address should be used for http and https requests
Codecov Report
:exclamation: No coverage uploaded for pull request base (
develop@59198eb). Click here to learn what that means. The diff coverage is86.44%.
@@ Coverage Diff @@
## develop #66 +/- ##
==========================================
Coverage ? 96.19%
==========================================
Files ? 19
Lines ? 2078
Branches ? 390
==========================================
Hits ? 1999
Misses ? 36
Partials ? 43
| Impacted Files | Coverage Δ | |
|---|---|---|
| warcio/archiveiterator.py | 100% <ø> (ø) |
|
| warcio/tester.py | 88.96% <100%> (ø) |
|
| warcio/recordloader.py | 98.69% <100%> (ø) |
|
| warcio/bufferedreaders.py | 94.81% <57.89%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 59198eb...fc19c7d. Read the comment docs.
@N0taN3rd traditionally you've been my best reviewer :-)
@N0taN3rd has done a preliminary review, the main addition since then is some global checks.
At this point I think the code is feature-complete, well, for the things I'm planning for the first pass, and the main work remaining is coverage.