minhash icon indicating copy to clipboard operation
minhash copied to clipboard

For Chinese the similarity is not very accurate

Open Anandonzy opened this issue 3 years ago • 5 comments

The code String text = "新冠疫苗效果不错"; byte[] minhash = calculateMinHash(text); String text1 = "每天吃饭呀哈哈哈"; byte[] minhash1 = calculateMinHash(text1); float score1 = MinHash.compare(minhash, minhash1); the result is "0.546875" you readme result below 0.5 is not simolarlity But now ,The two text is not simolarlity.But result has been greater than 0.5. If I use have problem ,Can you help me. Thanks

Anandonzy avatar Apr 28 '22 12:04 Anandonzy

MinHash is not an actual similarity. The value of over 0.5 does not mean the same data. Please refer to MinHash algorism.

marevol avatar Apr 28 '22 13:04 marevol

The read me is "compare method returns a similarity between texts. The value is from 0 to 1. But a value below 0.5 means different texts." The result close to 1 .The two texts are similarity? If my think is right,can i use the result between 0.8 and 1 .The texts is similarity.

Anandonzy avatar Apr 29 '22 03:04 Anandonzy

@marevol hello

Anandonzy avatar May 09 '22 09:05 Anandonzy

Please refer to b-Bit Minwise Hashing.

marevol avatar May 09 '22 21:05 marevol

ok,Thanks

Anandonzy avatar May 10 '22 11:05 Anandonzy