datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

Does fineweb.py perform Element and paragraph level deduplication?

Open silverriver opened this issue 1 year ago • 0 comments

I am reading the script for reproducing fineweb.

I have noticed that in the first pipeline that you use Trafilatura to extract text out of WARC Records:

main_processing_executor = SlurmPipelineExecutor(
    job_name=f"cc_{DUMP_TO_PROCESS}",
    pipeline=[
        WarcReader(
            f"s3://commoncrawl/crawl-data/{DUMP_TO_PROCESS}/segments/",
            glob_pattern="*/warc/*",  # we want the warc files
            default_metadata={"dump": DUMP_TO_PROCESS},
        ),
        URLFilter(exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/1_url/{DUMP_TO_PROCESS}")),
        Trafilatura(favour_precision=True),
        LanguageFilter(
...

and the deduplicate flag for the Trafilatura class is set to be True

class Trafilatura(BaseExtractor):
    """Trafilatura extractor, it uses https://trafilatura.readthedocs.io/en/latest/index.html

    We're actually only using the main entry point of trafilatura: the `extract` function.
    No specific data structure is exchanged with Trafilatura, only the text is passed and the extracted text is returned.
    Alternatively and identically, `trafilatura` could be used through its command line main interface.

    Args:
        favour_precision: prefer less text but correct extraction.
        include_images: not implemented currently
        timeout: the timeout for extraction, per document, in seconds
        deduplicate: trafilatura's deduplicate option
        **kwargs: any other option will be passed to trafilatura
    """

    name = "⛏ Trafilatura"
    _requires_dependencies = ["trafilatura"]

    def __init__(
        self,
        favour_precision: bool = True,
        include_images: bool = False,
        timeout: float = 0.1,
        deduplicate: bool = True,
        **kwargs,
    ):
        super().__init__(timeout)
        self.favour_precision = favour_precision
        self.include_images = include_images
        self.deduplicate = deduplicate
        self.kwargs = kwargs
        if self.include_images:
            raise NotImplementedError
...

see this line: https://github.com/huggingface/datatrove/blob/c7f6f516abc1349e4995451ff4017790d00d2d68/src/datatrove/pipeline/extractors/trafilatura.py#L27

Is that means the fineweb use the Element and paragraph level dedup feature provided in Trafilatura by default?

And I am also wondering how does this flag affect the final dataset, i.e., what if I set deduplicate=False here?

silverriver avatar Oct 09 '24 11:10 silverriver