browsertrix
browsertrix copied to clipboard
add deduplication index to a repeated crawl with same job config
We need deduplication to save storage in repeated crawls of the same job based on a dynamic created index of the previous crawl
There is already an interesting idea of doing deduplication on file system level in this Webrecorder Blog post.
I also like the idea of adding time series capability to the WACZ format.
This is the spec: https://github.com/webrecorder/specs/blob/main/wacz-ipfs/latest/index.md
One of the core product goals of Browsertrix is that all archived items generated by the crawler are self-contained and can be viewed without referencing others. This doesn't mean we'll never address de-duplication, but becaue we want to avoid archived items breaking, it would either have to be done at the filesystem level (see above, also incredibly non-trivial to implement and comes with other performance challenges), or by other means ensuring that users understand they no longer have the ability to export or view the single archived items they want de-duplicated and instead placing them into a de-duplicated bucket of content.
...But both of those possibilities are way far out in the future and not something we're prepared to address at this time. Until then however, we actually do have plans to accomplish some of what this issue requests in #1372. I think this solution is a decent compromise and will greatly reduce the volume of re-crawled duplicate content, especailly in scheduled crawls.