simhash icon indicating copy to clipboard operation
simhash copied to clipboard

Calculate percentage similar

Open vbisbest opened this issue 5 years ago • 1 comments

How can I take the distance and compute a percentage of similarity? For instance, if given this example

	a := []byte("this is a test for results")
	aHash := simhash.Simhash(simhash.NewWordFeatureSet(a))

	b := []byte("this is a test for cats")
	bHash := simhash.Simhash(simhash.NewWordFeatureSet(b))

	c := simhash.Compare(aHash, bHash)
	fmt.Println(c)

I get an output of 7. But I would like to see that these are 90% similar (or whatever the exact amount is). Thank you.

vbisbest avatar May 26 '20 15:05 vbisbest

SimHash is used to compute a distance between two texts. When the distance equals to zero the two texts are similar. The output of Compare function is this distance. You have a complete example in the README:

Comparison of `this is a test phrase` and `this is a test phrass`: 2
Comparison of `this is a test phrase` and `foo bar`: 29

If you want to calculate a percentage of similarity, you have to find out a MAXIMUM distance value and use it in a formula such as:

100 - ((distance / MAXIMUM) * 100)

Let's take 1000 for MAXIMUM's value and apply the formula to the two examples of the README:

  • 100 - ((2 / 1000) * 100) = 99.8
  • 100 - ((29 / 1000) * 100) = 97.1

So you can say that:

  • the percentage of similarity between this is a test phrase and this is a test phrass is 99.8%
  • the percentage of similarity between this is a test phrase and foo bar is 97.1%

Now you may ask why 1000 and not the maximum value that could be given by Compare function? Compare returns an uint64 which ranges from 0 up to 2^64 (or 18446744073709551615). That's was a rough normalization (adjust the scale to get values that make sense when taking into account the set of values to which they belong), because:

  • 100 - ((2 / 2^64) * 100) aprox. 99,9999999999998 %
  • 100 - ((29 / 2^64) * 100) aprox. 99,9999999999971 %

And depending what you want to do with this percentage (e.g. display it into a report or give it to a machine) you have to take into account the variable type that will store or display this value.

Now you have to find out the MAXIMUM's value that suits your use case. See https://en.wikipedia.org/wiki/Normalization_(statistics)

bbalet avatar May 26 '20 17:05 bbalet