datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

URL dedup of two datasets

Open basma-b opened this issue 9 months ago • 1 comments

Hello,

how can I use URL dedup to deup two datasets on the URL level. Basically I want to know what are the documents in dataset A that are not in dataset B based on the URL? The input of the dedup is one dataset and I am wondering how can I input 2 datasets?

many thanks

basma-b avatar May 19 '24 10:05 basma-b