warc-specifications icon indicating copy to clipboard operation
warc-specifications copied to clipboard

Add a primer on WARC deduplication

Open anjackson opened this issue 10 years ago • 3 comments

  • Explain WARC Deduplication
  • Describe the various historical deduplication approaches as well as including the latest and greatest form.

anjackson avatar Jul 10 '15 20:07 anjackson

Does the dedup standard allow for deduping across multiple independent WARC files, or is it only for deduping within a single WARC?

Also are they any considerations made for optimizing filesystem-layer deduping across multiple WARC files? (probably not but I'd just like to confirm it) Is there a way to make sure uncompressed byte sequences start at rounded byte offsets within a WARC so that block-level dedup via something like ZFS fast dedup could detect identital blocks at different locations within two different WARCs.

pirate avatar Oct 30 '24 23:10 pirate

Does the dedup standard allow for deduping across multiple independent WARC files

Yes, revisit records can refer to records in other WARC files. The common way they're used is that you run one crawl, producing an initial set of WARC files, and then you run a second crawl that produces a new set of WARC files with revisit records resolving against the original crawl. You then build an index keyed on URL and date with all of the records of both crawls together and use the index to locate the original record that a revisit refers to.

Also are they any considerations made for optimizing filesystem-layer deduping across multiple WARC files

Not that I'm aware of.

Is there a way to make sure uncompressed byte sequences start at rounded byte offsets within a WARC

I haven't heard of anyone doing this and personally I would probably prefer using revisit records but I guess theoretically for uncompressed WARC files you could try adding padding in a custom header field to align the start of the payload with a block boundary.

For gzipped WARCs you could maybe try to add padding to the EXTRA or COMMENT fields in the gzip header. You'd probably also need to compress the payload as a separate gzip member to the WARC and HTTP headers because the headers differing would cause changes in the rest of the compressed stream. There might be some compatibility issues with tools that assume a WARC record doesn't span gzip members though.

ato avatar Nov 01 '24 05:11 ato

Awesome thanks for the info!

pirate avatar Nov 01 '24 06:11 pirate