warcio icon indicating copy to clipboard operation
warcio copied to clipboard

Add feature to skip past corrupted records in a warc.gz file

Open lukeplausin opened this issue 5 years ago • 4 comments

Hi, I am using warcio to process some large warc.gz files. I noticed that some of these files have random corruptions part way through the file which prevents the ArchiveIterator from reading all the way through.

The change in this PR allows the ArchiveIterator to skip past random errors until it finds another valid WARC header. That way I can still read the contents of the file after the corruption. This behaviour is configurable with the skip_bad_records parameter, and the default behaviour is to reraise the exception.

lukeplausin avatar Nov 07 '19 10:11 lukeplausin

Codecov Report

Merging #99 into master will decrease coverage by 0.3%. The diff coverage is 44.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #99      +/-   ##
==========================================
- Coverage   98.49%   98.19%   -0.31%     
==========================================
  Files          18       18              
  Lines        1598     1607       +9     
  Branches      256      259       +3     
==========================================
+ Hits         1574     1578       +4     
- Misses          4        8       +4     
- Partials       20       21       +1
Impacted Files Coverage Δ
warcio/archiveiterator.py 96.45% <44.44%> (-3.55%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8e3ceb7...7776bef. Read the comment docs.

codecov[bot] avatar Nov 07 '19 10:11 codecov[bot]

I do like the overall idea of having some error recovery. This code depends on read_to_end() working correctly, so its only going to handle a limited set of possible corruptions.

As far as this pull request goes, it's a bit incomplete. Most importantly, when the new code skips a record, it doesn't inform the user in any way. It also needs tests, documentation, and a way to set this flag in the command-line tool.

wumpus avatar Nov 07 '19 17:11 wumpus

Yes, a quick test would be good to have. I think you might be able to test it with /test/data/example-wrong-chunks.warc.gz for example, if not another WARC.

But curious what are the use cases that you're thinking of, that the WARC would be partially corrupt?

More for manual use of ArchiveIterator, or for any of the command-line tools? I'd say its fine if its for manual use only, for special cases.. If so, I don't think additional notification is strictly needed, but what did you have in mind @wumpus ? Again, think it depends on what use cases this is intended for.

ikreymer avatar Nov 08 '19 00:11 ikreymer

Yes I think this would need some unit tests. I've not gotten around to doing one just yet!

lukeplausin avatar Nov 08 '19 10:11 lukeplausin