datatrove
datatrove copied to clipboard
Does fineweb.py perform Element and paragraph level deduplication?
I am reading the script for reproducing fineweb.
I have noticed that in the first pipeline that you use Trafilatura to extract text out of WARC Records:
main_processing_executor = SlurmPipelineExecutor(
job_name=f"cc_{DUMP_TO_PROCESS}",
pipeline=[
WarcReader(
f"s3://commoncrawl/crawl-data/{DUMP_TO_PROCESS}/segments/",
glob_pattern="*/warc/*", # we want the warc files
default_metadata={"dump": DUMP_TO_PROCESS},
),
URLFilter(exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/1_url/{DUMP_TO_PROCESS}")),
Trafilatura(favour_precision=True),
LanguageFilter(
...
and the deduplicate flag for the Trafilatura class is set to be True
class Trafilatura(BaseExtractor):
"""Trafilatura extractor, it uses https://trafilatura.readthedocs.io/en/latest/index.html
We're actually only using the main entry point of trafilatura: the `extract` function.
No specific data structure is exchanged with Trafilatura, only the text is passed and the extracted text is returned.
Alternatively and identically, `trafilatura` could be used through its command line main interface.
Args:
favour_precision: prefer less text but correct extraction.
include_images: not implemented currently
timeout: the timeout for extraction, per document, in seconds
deduplicate: trafilatura's deduplicate option
**kwargs: any other option will be passed to trafilatura
"""
name = "⛏ Trafilatura"
_requires_dependencies = ["trafilatura"]
def __init__(
self,
favour_precision: bool = True,
include_images: bool = False,
timeout: float = 0.1,
deduplicate: bool = True,
**kwargs,
):
super().__init__(timeout)
self.favour_precision = favour_precision
self.include_images = include_images
self.deduplicate = deduplicate
self.kwargs = kwargs
if self.include_images:
raise NotImplementedError
...
see this line: https://github.com/huggingface/datatrove/blob/c7f6f516abc1349e4995451ff4017790d00d2d68/src/datatrove/pipeline/extractors/trafilatura.py#L27
Is that means the fineweb use the Element and paragraph level dedup feature provided in Trafilatura by default?
And I am also wondering how does this flag affect the final dataset, i.e., what if I set deduplicate=False here?