jiebaR
jiebaR copied to clipboard
Using jiebaR package (SimHash algorithm)
Hello
Here are 2 texts I would like to check for near duplicate thanks to the SimHash algorithm (jiebaR package):
library(jiebaR)
coder <- "Simhash detects near duplicates and not exact duplicates"
codel <- "SimHash is a technique for quickly detect near duplicates"
I have create a worker called "simhasher":
simhasher = worker("simhash", topn = 5)
simhasher <= codel
Then I have computed the distance:
distance(codel, coder, simhasher)
Here is the result:
$distance
[1] 22
$lhs
11.7392 11.7392 11.7392 11.7392 11.7392
"duplicates" "technique" "SimHash" "detect" "quickly"
$rhs
23.4784 11.7392 11.7392 11.7392
"duplicates" "Simhash" "detects" "exact"
I need you help on 3 things:
-
the distance is 22. The bigger the distance is, the more the 2 texts are different. Here texts seems REALLY close, so I was expected the distante to be smaller... Can you please explain me this result?
-
What are the figures above the words in lhs and rhs ? (e.g: 11.7392 , 23.4784)
-
I also checked the worker I have created :
simhasher <= codel
And here is the result I discovered:
$simhash
[1] "12382334418040220206"
$keyword
11.7392 11.7392 11.7392 11.7392 11.7392
"duplicates" "technique" "SimHash" "detect" "quickly"
What is the simhash here and why do I need to create it before to run the distance function? This part is not really clear to me and not really explained inside the package documentation.
Can you please help me? This package seems really powerfull but I feel like I only understand 5% of it.
@remibacha
the jiebaR::distance
first use TF-IDF calculate the keywords
, then use these keywords to generate 64bits hash code, last, calucuate the hamming-distance between the hash codes.
Here is an example:
library(jiebaR)
#> Loading required package: jiebaRD
simhasher_5 = worker("simhash", topn = 5)
keyword_1 <- c("Simhash", "duplicates")
keyword_2 <- c("Simhash", "quickly")
simhash_1 <- vector_simhash(keyword_1, simhasher_5)
simhash_1
#> $simhash
#> [1] "144150442997195320"
#>
#> $keyword
#> 11.7392 11.7392
#> "Simhash" "duplicates"
simhash_2 <- vector_simhash(keyword_2, simhasher_5)
simhash_2
#> $simhash
#> [1] "1730138795753340968"
#>
#> $keyword
#> 11.7392 11.7392
#> "Simhash" "quickly"
tobin(simhash_1$simhash)
#> [1] "0000001000000000001000000001000001101101000100000010001000111000"
tobin(simhash_2$simhash)
#> [1] "0001100000000010101100000001000101101101000000000000000000101000"
# hamming-distance
simhash_dist(simhash_1$simhash, simhash_2$simhash)
#> [1] 11
vector_distance(keyword_1, keyword_2, simhasher_5)
#> $distance
#> [1] 11
#>
#> $lhs
#> 11.7392 11.7392
#> "Simhash" "duplicates"
#>
#> $rhs
#> 11.7392 11.7392
#> "Simhash" "quickly"
# only one keyword "Simhash"
simhasher_1 <- worker("simhash", topn = 1)
simhash_1 <- vector_simhash(keyword_1, simhasher_1)
simhash_1
#> $simhash
#> [1] "1883542797686548280"
#>
#> $keyword
#> 11.7392
#> "Simhash"
simhash_2 <- vector_simhash(keyword_2, simhasher_1)
simhash_2
#> $simhash
#> [1] "1883542797686548280"
#>
#> $keyword
#> 11.7392
#> "Simhash"
tobin(simhash_1$simhash)
#> [1] "0001101000100011101100000011000111101111010110100010011100111000"
tobin(simhash_2$simhash)
#> [1] "0001101000100011101100000011000111101111010110100010011100111000"
# hamming-distance
simhash_dist(simhash_1$simhash, simhash_2$simhash)
#> [1] 0
vector_distance(keyword_1, keyword_2, simhasher_1)
#> $distance
#> [1] 0
#>
#> $lhs
#> 11.7392
#> "Simhash"
#>
#> $rhs
#> 11.7392
#> "Simhash"
Created on 2018-10-23 by the reprex package (v0.2.0).
hamming_distance: https://en.wikipedia.org/wiki/Hamming_distance
You can modify the user dict in jiebaRD, ?USERPATH
, ?edit_dict
, which can change the weight of word's TF-IDF.
Thanks for this example, really helpfull ! But I still don't get what the figures above the words in lhs and rhs are (e.g: 11.7392). Can you please explain it?
@remibacha jiebaR is design for Chinese Text Segment, it has a default idf dict which only contains Chinse words. Maybe the default idf weight for English word is 11.7392
. So, the tf-idf = tf * idf. Here is an example:
IDFPATH
#> [1] "E:/R/R-3.5-library/jiebaRD/dict/idf.utf8"
keys = worker("keywords", topn = 2)
keys <= "Simhash is quick, Simhash ia fast"
#> 23.4784 11.7392
#> "Simhash" "fast"
If you want to get a more accuary tf-idf weight, you need to train the Corpus yourself. The get_idf
function may help you. Then you can use worker("keywords", idf = "path to your idf.dict", ....)
Suppose you have many Englisth corpus, you can use these corpus to trian idf, then, use worker("simhash", ...)
to generate every doc's simhash value, last, you can use simhash_dist_mat
to get the distance of the docments.
There is
stringdist
package, which can calculate various string distances based on edits (Damerau-Levenshtein, Hamming, Levenshtein, optimal sting alignment), qgrams (q- gram, cosine, jaccard distance) or heuristic metrics (Jaro, Jaro-Winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences. This package is built for speed and runs in parallel by using 'openMP'. An API for C or C++ is exposed as well
I think the main trick is to hash the keyword and weight to the simhash code, and it is pretty fast for calculating hamming-distance, which can used for de-duplicate docs. for more, you can read https://github.com/yanyiwu/simhash/blob/master/README_EN.md the author's cppjieba is the soure of jiebaR. Some introductions: https://github.com/seomoz/simhash-cpp/#architecture and https://yanyiwu.com/work/2014/01/30/simhash-shi-xian-xiang-jie.html