datatrove
datatrove copied to clipboard
URL dedup of two datasets
Hello,
how can I use URL dedup to deup two datasets on the URL level. Basically I want to know what are the documents in dataset A that are not in dataset B based on the URL? The input of the dedup is one dataset and I am wondering how can I input 2 datasets?
many thanks