fundus
fundus copied to clipboard
CC-News benchmark
This PR introduces functionality to benchmark publishers using the CC-NEWS dataset.
The benchmarking process involves retrieving HTML and articles at specified intervals (daily, weekly, monthly, etc.) from the CC-NEWS dataset, assessing the completeness of the article extraction, and offering utility and statistical functions for operating on the benchmark. The goal is to detect any layout changes that occurred before the initial implementation of a specific parser and to provide the relevant HTML to address these changes.