warcio icon indicating copy to clipboard operation
warcio copied to clipboard

NOT READY: warcio test

Open wumpus opened this issue 6 years ago • 3 comments

An opinionated WARC standards-conformance tool.

Ready for review - I have yet to work on test coverage.

$ warcio test test/data/*.warc.gz test/data/*.warc
test/data/example-bad-non-chunked.warc.gz
  saw exception 
    ERROR: non-chunked gzip file detected, gzip block continues
    beyond single record.

    This file is probably not a multi-member gzip but a single gzip file.

    To allow seek, a gzipped WARC must have each record compressed into
    a single gzip member and concatenated together.

    This file is likely still valid and can be fixed by running:

    warcio recompress <path/to/file> <path/to/new_file>
  skipping rest of file
test/data/example-resource.warc.gz
  WARC-Record-ID <urn:uuid:6e7f60da-2c7b-11e7-aaf7-0242ac120007>
    WARC-Type resource
    digest pass
    comment: unknown field, no validation performed Warc-Referer https://webrecorder.io/temp-GRWZVUTV/temp/test/record/http://example.com/
    comment: unknown field, no validation performed Warc-User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36
test/data/example.warc.gz
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:e6e395ca-0221-11e7-a18d-0242ac120005>
    WARC-Type revisit
    digest present but not checked
    recommendation: missing recommended header WARC-Refers-To
    comment: field was introduced after this warc version WARC-Refers-To-Target-URI http://example.com/ 1.0
    comment: field was introduced after this warc version WARC-Refers-To-Date 2017-03-06T04:02:06Z 1.0
  WARC-Record-ID <urn:uuid:e6e41fea-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests
test/data/example-wget-bad-target-uri.warc.gz
  WARC-Record-ID <urn:uuid:CEF11DC9-8D86-4F4B-9B8C-2235515B4537>
    WARC-Type request
    digest pass
    error: uri must not be within <> warc-target-uri <http://example.com/>
    error: invalid uri scheme, bad character warc-target-uri <http://example.com/>
  WARC-Record-ID <urn:uuid:FD8A6D04-AF8A-4A36-A889-8094487CDF2D>
    WARC-Type response
    payload digest failed sha1:B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A
    error: uri must not be within <> warc-target-uri <http://example.com/>
    error: invalid uri scheme, bad character warc-target-uri <http://example.com/>
  WARC-Record-ID <urn:uuid:E5AC383F-F107-47BC-99B7-176FD8DE6E94>
    WARC-Type metadata
    digest pass
    error: uri must not be within <> warc-target-uri <metadata://gnu.org/software/wget/warc/MANIFEST.txt>
    error: invalid uri scheme, bad character warc-target-uri <metadata://gnu.org/software/wget/warc/MANIFEST.txt>
  WARC-Record-ID <urn:uuid:543BCA4F-A305-4383-B511-0BCF23F7AD8D>
    WARC-Type resource
    digest pass
    error: uri must not be within <> warc-target-uri <metadata://gnu.org/software/wget/warc/wget_arguments.txt>
    error: invalid uri scheme, bad character warc-target-uri <metadata://gnu.org/software/wget/warc/wget_arguments.txt>
  WARC-Record-ID <urn:uuid:CCD67DB5-13FA-447B-BF05-BF1BDC8ED3A0>
    WARC-Type resource
    digest pass
    error: uri must not be within <> warc-target-uri <metadata://gnu.org/software/wget/warc/wget.log>
    error: invalid uri scheme, bad character warc-target-uri <metadata://gnu.org/software/wget/warc/wget.log>
test/data/example-wrong-chunks.warc.gz
  saw exception Invalid WARC record, first line: <!doctype html>
  skipping rest of file
test/data/post-test.warc.gz
  WARC-Record-ID <urn:uuid:59a6b068-cbc2-4767-9525-33043d2709c7>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:5eb8ee92-cda1-4503-a7a3-c63f1ab6515b>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:c79a62e3-5a4b-450d-a093-3a7fefa09664>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
test/data/example-digest-bad.warc
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    payload digest failed: sha1:1112H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
test/data/example-iana.org-chunked.warc
  WARC-Record-ID <urn:uuid:c46fbf5f-0876-4652-a348-e9b6c322eabb>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
test/data/example-trunc.warc
  WARC-Record-ID <urn:uuid:a9c51e3e-0221-11e7-bf66-0242ac120005>
    WARC-Type response
    block digest failed: sha1:DR5MBP7OD3OPA7RFKWJUD4CTNUQUGFC5
    payload digest failed sha1:G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK
    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: 2560
    Remainder: b'\x00\x00\r\n'
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests
test/data/example.warc
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:e6e395ca-0221-11e7-a18d-0242ac120005>
    WARC-Type revisit
    digest present but not checked
    recommendation: missing recommended header WARC-Refers-To
    comment: field was introduced after this warc version WARC-Refers-To-Target-URI http://example.com/ 1.0
    comment: field was introduced after this warc version WARC-Refers-To-Date 2017-03-06T04:02:06Z 1.0
  WARC-Record-ID <urn:uuid:e6e41fea-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests

wumpus avatar Jan 26 '19 00:01 wumpus

Codecov Report

:exclamation: No coverage uploaded for pull request base (develop@59198eb). Click here to learn what that means. The diff coverage is 86.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##             develop      #66   +/-   ##
==========================================
  Coverage           ?   96.19%           
==========================================
  Files              ?       19           
  Lines              ?     2078           
  Branches           ?      390           
==========================================
  Hits               ?     1999           
  Misses             ?       36           
  Partials           ?       43
Impacted Files Coverage Δ
warcio/archiveiterator.py 100% <ø> (ø)
warcio/tester.py 88.96% <100%> (ø)
warcio/recordloader.py 98.69% <100%> (ø)
warcio/bufferedreaders.py 94.81% <57.89%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 59198eb...fc19c7d. Read the comment docs.

codecov[bot] avatar Jan 26 '19 01:01 codecov[bot]

@N0taN3rd traditionally you've been my best reviewer :-)

wumpus avatar Jan 26 '19 02:01 wumpus

@N0taN3rd has done a preliminary review, the main addition since then is some global checks.

At this point I think the code is feature-complete, well, for the things I'm planning for the first pass, and the main work remaining is coverage.

wumpus avatar Jan 30 '19 01:01 wumpus