For Chinese the similarity is not very accurate
The code
String text = "新冠疫苗效果不错"; byte[] minhash = calculateMinHash(text); String text1 = "每天吃饭呀哈哈哈"; byte[] minhash1 = calculateMinHash(text1); float score1 = MinHash.compare(minhash, minhash1);
the result is "0.546875"
you readme result below 0.5 is not simolarlity
But now ,The two text is not simolarlity.But result has been greater than 0.5.
If I use have problem ,Can you help me. Thanks
MinHash is not an actual similarity. The value of over 0.5 does not mean the same data. Please refer to MinHash algorism.
The read me is "compare method returns a similarity between texts. The value is from 0 to 1. But a value below 0.5 means different texts." The result close to 1 .The two texts are similarity? If my think is right,can i use the result between 0.8 and 1 .The texts is similarity.
@marevol hello
Please refer to b-Bit Minwise Hashing.
ok,Thanks