datatrove
datatrove copied to clipboard
Pipeline for data contamination
Hi guys,
How can I implement a pipeline to check for contamination between two different datasets (e.g., pre-training vs. fine-tuning datasets) and eventually delete marked documents from the pre-training dataset?
Thanks.