datatrove
datatrove copied to clipboard
Potential issue of dedup in index
Hi, when I'm running the minhash dedup by index, I find the cluster results produced by MinhashDedupCluster is a bit strange.
-rw-r--r-- 1 root root 108K Jul 12 12:40 001194.clusters
-rw-r--r-- 1 root root 54K Jul 12 12:40 001194.remove
-rw-r--r-- 1 root root 108K Jul 12 12:40 001195.clusters
-rw-r--r-- 1 root root 54K Jul 12 12:40 001195.remove
-rw-r--r-- 1 root root 107K Jul 12 12:40 001196.clusters
-rw-r--r-- 1 root root 54K Jul 12 12:40 001196.remove
-rw-r--r-- 1 root root 107K Jul 12 12:40 001197.clusters
-rw-r--r-- 1 root root 54K Jul 12 12:40 001197.remove
-rw-r--r-- 1 root root 106K Jul 12 12:40 001198.clusters
-rw-r--r-- 1 root root 53K Jul 12 12:40 001198.remove
-rw-r--r-- 1 root root 107K Jul 12 12:40 001199.clusters
-rw-r--r-- 1 root root 54K Jul 12 12:40 001199.remove
-rw-r--r-- 1 root root 8 Jul 12 12:40 4294967295.clusters
-rw-r--r-- 1 root root 4 Jul 12 12:40 4294967295.remove
There is an outlier which might be due to the SENTINEL token being treated as doc to be removed. So there might be a logical bug in the code?