jiebaR Using jiebaR package (SimHash algorithm)

Hello

Here are 2 texts I would like to check for near duplicate thanks to the SimHash algorithm (jiebaR package):

 library(jiebaR)
 coder <- "Simhash detects near duplicates and not exact duplicates"
 codel <- "SimHash is a technique for quickly detect near duplicates"

I have create a worker called "simhasher":

 simhasher = worker("simhash", topn = 5)
 simhasher <= codel

Then I have computed the distance:

 distance(codel, coder, simhasher)

Here is the result:

 $distance
 [1] 22

 $lhs
 11.7392      11.7392      11.7392      11.7392      11.7392 
 "duplicates"  "technique"    "SimHash"     "detect"    "quickly" 

 $rhs
 23.4784      11.7392      11.7392      11.7392 
 "duplicates"    "Simhash"    "detects"      "exact"

I need you help on 3 things:

the distance is 22. The bigger the distance is, the more the 2 texts are different. Here texts seems REALLY close, so I was expected the distante to be smaller... Can you please explain me this result?
What are the figures above the words in lhs and rhs ? (e.g: 11.7392 , 23.4784)
I also checked the worker I have created :

simhasher <= codel

And here is the result I discovered:

 $simhash
 [1] "12382334418040220206"

 $keyword
 11.7392      11.7392      11.7392      11.7392      11.7392 
 "duplicates"  "technique"    "SimHash"     "detect"    "quickly"

What is the simhash here and why do I need to create it before to run the distance function? This part is not really clear to me and not really explained inside the package documentation.

Can you please help me? This package seems really powerfull but I feel like I only understand 5% of it.

Oct 22 '18 15:10 remibacha

@remibacha the jiebaR::distance first use TF-IDF calculate the keywords, then use these keywords to generate 64bits hash code, last, calucuate the hamming-distance between the hash codes. Here is an example:

library(jiebaR)
#> Loading required package: jiebaRD
simhasher_5 = worker("simhash", topn = 5)
keyword_1 <- c("Simhash", "duplicates")
keyword_2 <- c("Simhash", "quickly")
simhash_1 <- vector_simhash(keyword_1, simhasher_5)
simhash_1
#> $simhash
#> [1] "144150442997195320"
#> 
#> $keyword
#>      11.7392      11.7392 
#>    "Simhash" "duplicates"
simhash_2 <- vector_simhash(keyword_2, simhasher_5)
simhash_2
#> $simhash
#> [1] "1730138795753340968"
#> 
#> $keyword
#>   11.7392   11.7392 
#> "Simhash" "quickly"

tobin(simhash_1$simhash)
#> [1] "0000001000000000001000000001000001101101000100000010001000111000"
tobin(simhash_2$simhash)
#> [1] "0001100000000010101100000001000101101101000000000000000000101000"
# hamming-distance
simhash_dist(simhash_1$simhash, simhash_2$simhash)
#> [1] 11
vector_distance(keyword_1, keyword_2, simhasher_5)
#> $distance
#> [1] 11
#> 
#> $lhs
#>      11.7392      11.7392 
#>    "Simhash" "duplicates" 
#> 
#> $rhs
#>   11.7392   11.7392 
#> "Simhash" "quickly"

# only one keyword "Simhash"
simhasher_1 <- worker("simhash", topn = 1)
simhash_1 <- vector_simhash(keyword_1, simhasher_1)
simhash_1
#> $simhash
#> [1] "1883542797686548280"
#> 
#> $keyword
#>   11.7392 
#> "Simhash"

simhash_2 <- vector_simhash(keyword_2, simhasher_1)
simhash_2
#> $simhash
#> [1] "1883542797686548280"
#> 
#> $keyword
#>   11.7392 
#> "Simhash"

tobin(simhash_1$simhash)
#> [1] "0001101000100011101100000011000111101111010110100010011100111000"
tobin(simhash_2$simhash)
#> [1] "0001101000100011101100000011000111101111010110100010011100111000"
# hamming-distance
simhash_dist(simhash_1$simhash, simhash_2$simhash)
#> [1] 0

vector_distance(keyword_1, keyword_2, simhasher_1)
#> $distance
#> [1] 0
#> 
#> $lhs
#>   11.7392 
#> "Simhash" 
#> 
#> $rhs
#>   11.7392 
#> "Simhash"

Created on 2018-10-23 by the reprex package (v0.2.0).

hamming_distance: https://en.wikipedia.org/wiki/Hamming_distance

You can modify the user dict in jiebaRD, ?USERPATH, ?edit_dict, which can change the weight of word's TF-IDF.

Oct 23 '18 07:10 BruceZhaoR

Thanks for this example, really helpfull ! But I still don't get what the figures above the words in lhs and rhs are (e.g: 11.7392). Can you please explain it?

Oct 24 '18 07:10 remibacha

@remibacha jiebaR is design for Chinese Text Segment, it has a default idf dict which only contains Chinse words. Maybe the default idf weight for English word is 11.7392. So, the tf-idf = tf * idf. Here is an example:

IDFPATH
#> [1] "E:/R/R-3.5-library/jiebaRD/dict/idf.utf8"
keys = worker("keywords", topn = 2)
keys <= "Simhash is quick, Simhash ia fast"
#> 23.4784   11.7392 
#> "Simhash"    "fast"

If you want to get a more accuary tf-idf weight, you need to train the Corpus yourself. The get_idf function may help you. Then you can use worker("keywords", idf = "path to your idf.dict", ....)

Suppose you have many Englisth corpus, you can use these corpus to trian idf, then, use worker("simhash", ...) to generate every doc's simhash value, last, you can use simhash_dist_mat to get the distance of the docments.

There is stringdist package, which can calculate various string distances based on edits (Damerau-Levenshtein, Hamming, Levenshtein, optimal sting alignment), qgrams (q- gram, cosine, jaccard distance) or heuristic metrics (Jaro, Jaro-Winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences. This package is built for speed and runs in parallel by using 'openMP'. An API for C or C++ is exposed as well

I think the main trick is to hash the keyword and weight to the simhash code, and it is pretty fast for calculating hamming-distance, which can used for de-duplicate docs. for more, you can read https://github.com/yanyiwu/simhash/blob/master/README_EN.md the author's cppjieba is the soure of jiebaR. Some introductions: https://github.com/seomoz/simhash-cpp/#architecture and https://yanyiwu.com/work/2014/01/30/simhash-shi-xian-xiang-jie.html

Oct 25 '18 03:10 BruceZhaoR

jiebaR jiebaR copied to clipboard

Using jiebaR package (SimHash algorithm)

jiebaR
jiebaR copied to clipboard