datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

Pipeline for data contamination

Open fabiancpl opened this issue 5 months ago • 0 comments

Hi guys,

How can I implement a pipeline to check for contamination between two different datasets (e.g., pre-training vs. fine-tuning datasets) and eventually delete marked documents from the pre-training dataset?

Thanks.

fabiancpl avatar Jul 07 '25 08:07 fabiancpl