datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

Potential issue of dedup in index

Open jordane95 opened this issue 7 months ago • 0 comments

Hi, when I'm running the minhash dedup by index, I find the cluster results produced by MinhashDedupCluster is a bit strange.

-rw-r--r--    1 root root 108K Jul 12 12:40 001194.clusters
-rw-r--r--    1 root root  54K Jul 12 12:40 001194.remove
-rw-r--r--    1 root root 108K Jul 12 12:40 001195.clusters
-rw-r--r--    1 root root  54K Jul 12 12:40 001195.remove
-rw-r--r--    1 root root 107K Jul 12 12:40 001196.clusters
-rw-r--r--    1 root root  54K Jul 12 12:40 001196.remove
-rw-r--r--    1 root root 107K Jul 12 12:40 001197.clusters
-rw-r--r--    1 root root  54K Jul 12 12:40 001197.remove
-rw-r--r--    1 root root 106K Jul 12 12:40 001198.clusters
-rw-r--r--    1 root root  53K Jul 12 12:40 001198.remove
-rw-r--r--    1 root root 107K Jul 12 12:40 001199.clusters
-rw-r--r--    1 root root  54K Jul 12 12:40 001199.remove
-rw-r--r--    1 root root    8 Jul 12 12:40 4294967295.clusters
-rw-r--r--    1 root root    4 Jul 12 12:40 4294967295.remove

There is an outlier which might be due to the SENTINEL token being treated as doc to be removed. So there might be a logical bug in the code?

jordane95 avatar Jul 12 '24 13:07 jordane95