warcio
warcio copied to clipboard
Add feature to skip past corrupted records in a warc.gz file
Hi, I am using warcio to process some large warc.gz files. I noticed that some of these files have random corruptions part way through the file which prevents the ArchiveIterator
from reading all the way through.
The change in this PR allows the ArchiveIterator
to skip past random errors until it finds another valid WARC header. That way I can still read the contents of the file after the corruption. This behaviour is configurable with the skip_bad_records
parameter, and the default behaviour is to reraise the exception.
Codecov Report
Merging #99 into master will decrease coverage by
0.3%
. The diff coverage is44.44%
.
@@ Coverage Diff @@
## master #99 +/- ##
==========================================
- Coverage 98.49% 98.19% -0.31%
==========================================
Files 18 18
Lines 1598 1607 +9
Branches 256 259 +3
==========================================
+ Hits 1574 1578 +4
- Misses 4 8 +4
- Partials 20 21 +1
Impacted Files | Coverage Δ | |
---|---|---|
warcio/archiveiterator.py | 96.45% <44.44%> (-3.55%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 8e3ceb7...7776bef. Read the comment docs.
I do like the overall idea of having some error recovery. This code depends on read_to_end() working correctly, so its only going to handle a limited set of possible corruptions.
As far as this pull request goes, it's a bit incomplete. Most importantly, when the new code skips a record, it doesn't inform the user in any way. It also needs tests, documentation, and a way to set this flag in the command-line tool.
Yes, a quick test would be good to have. I think you might be able to test it with /test/data/example-wrong-chunks.warc.gz
for example, if not another WARC.
But curious what are the use cases that you're thinking of, that the WARC would be partially corrupt?
More for manual use of ArchiveIterator, or for any of the command-line tools? I'd say its fine if its for manual use only, for special cases.. If so, I don't think additional notification is strictly needed, but what did you have in mind @wumpus ? Again, think it depends on what use cases this is intended for.
Yes I think this would need some unit tests. I've not gotten around to doing one just yet!