fundus icon indicating copy to clipboard operation
fundus copied to clipboard

CC-News benchmark

Open MaxDall opened this issue 1 year ago • 0 comments

This PR introduces functionality to benchmark publishers using the CC-NEWS dataset.

The benchmarking process involves retrieving HTML and articles at specified intervals (daily, weekly, monthly, etc.) from the CC-NEWS dataset, assessing the completeness of the article extraction, and offering utility and statistical functions for operating on the benchmark. The goal is to detect any layout changes that occurred before the initial implementation of a specific parser and to provide the relevant HTML to address these changes.

MaxDall avatar Aug 30 '24 14:08 MaxDall