browsertrix icon indicating copy to clipboard operation
browsertrix copied to clipboard

add deduplication index to a repeated crawl with same job config

Open tuehlarsen opened this issue 2 years ago • 2 comments

We need deduplication to save storage in repeated crawls of the same job based on a dynamic created index of the previous crawl

tuehlarsen avatar Mar 12 '23 08:03 tuehlarsen

There is already an interesting idea of doing deduplication on file system level in this Webrecorder Blog post.

I also like the idea of adding time series capability to the WACZ format.

This is the spec: https://github.com/webrecorder/specs/blob/main/wacz-ipfs/latest/index.md

sbaechler avatar Jun 17 '23 09:06 sbaechler

One of the core product goals of Browsertrix is that all archived items generated by the crawler are self-contained and can be viewed without referencing others. This doesn't mean we'll never address de-duplication, but becaue we want to avoid archived items breaking, it would either have to be done at the filesystem level (see above, also incredibly non-trivial to implement and comes with other performance challenges), or by other means ensuring that users understand they no longer have the ability to export or view the single archived items they want de-duplicated and instead placing them into a de-duplicated bucket of content.

...But both of those possibilities are way far out in the future and not something we're prepared to address at this time. Until then however, we actually do have plans to accomplish some of what this issue requests in #1372. I think this solution is a decent compromise and will greatly reduce the volume of re-crawled duplicate content, especailly in scheduled crawls.

Shrinks99 avatar Jan 31 '24 17:01 Shrinks99