simhashphp Does this class support elements in Chinese?

Does this class support elements in Chinese?

Open shtse8 opened this issue 7 years ago • 22 comments

I have tried to tokenize (using other NLP tools) the Chinese articles and pass it into FingerPrint->hash function.

I got two fingers: 0110010101101011111111100100110101011110000011101001000000000000 0111011101101011111100100100110111011110000001101011000000000000 index = 0.21626516682989

I don't understand why two fingers are similar but the index is so low. Two articles are nearly the same.

Jan 30 '17 16:01 shtse8

把默认的英文分词换成中文的分词准确率就基本上OK了

Feb 14 '17 08:02 hjy2588818

我已經是用自己的中文分詞了但計算出來的相似度好似很古怪。看來不是用海明距離。

On 14 Feb 2017 4:48 pm, "hjy2588818" [email protected] wrote:

把默认的英文分词换成中文的分词准确率就基本上OK了

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tgalopin/SimHashPhp/issues/4#issuecomment-279643996, or mute the thread https://github.com/notifications/unsubscribe-auth/AHpgg3bRS5gmywGgM-biPEM3X7ScoqTbks5rcWpmgaJpZM4LxnAP .

Feb 14 '17 10:02 shtse8

@shtse8 我是GaussianComparator(30) 这里给的30

Feb 14 '17 14:02 hjy2588818

GaussianComparator 有哪些特點？為甚麼不用海明距離呢？我看一般都是用海明距離的。

Feb 14 '17 14:02 shtse8

@shtse8

    $simhash = new \Tga\SimHash\SimHash();
    $extractor = new \Tga\SimHash\Extractor\SimpleTextExtractor();      // 分词
    $comparator = new \Tga\SimHash\Comparator\GaussianComparator(30);

    $fp1 = $simhash->hash($this->get_scws($text1), \Tga\SimHash\SimHash::SIMHASH_64);
    // die;
    $fp2 = $simhash->hash($this->get_scws($text2), \Tga\SimHash\SimHash::SIMHASH_64);
    
    // $fp1 = "1001010101010101000011000111010010001010010111110001000000000000";
    // $fp2 = "1001010100010101000011000111010000001000001111110011000000000000";

 
    var_dump($fp1->getBinary());
    var_dump($fp2->getBinary());

    // Index between 0 and 1 : 0.80073740291681
    $res = $comparator->compare($fp1, $fp2);

另外一点，不同的分词，分词结果不同，计算结果好像也有点点区别

Feb 14 '17 14:02 hjy2588818

http://www.cnblogs.com/maybe2030/p/5203186.html 每16个字符分割，不知道是怎么存的MySQL然后加快比较的

Feb 14 '17 14:02 hjy2588818

@shtse8 你上面那两串指纹，我计算出来是0.97314496305805

Feb 14 '17 14:02 hjy2588818

請教一下 GaussianComparator(30) 中用 30 deviation 是有甚麼作用？我看它預設好像是 3

Feb 14 '17 14:02 shtse8

我分詞是用HanLP 的CRF，分詞結果我是很滿意的，一直都在使用，只是SimHash我還在想自己去做還是用這個工具，不太明白GaussianComparator比Hamming Distance有甚麼優勢。

Feb 14 '17 14:02 shtse8

而且我有看到這套工具 \Tga\SimHash\SimHash::SIMHASH_64 是用CRC64來做傳統Hash 為甚麼不用MD5呢? 感覺上CRC分佈不夠平均。

Feb 14 '17 14:02 shtse8

GaussianComparator(30) 貌似这个值不能乱给。。。

Feb 15 '17 19:02 hjy2588818

那30的原因是?

Feb 15 '17 22:02 shtse8

尴尬了，他默认给的值是3，我参考了别的simhash在线计算的，看到30的时候比较接近。但是30应该是错的，不是这么用的，现在拿实际数据测试，发现很有问题，低于0.98以下的，两篇不一样的内容，但是有一些相同关键词的也能计算成这么多。现在迫切解决这个问题，我的网站有很多相似的内容，需要清除这些，不然成垃圾站了

Feb 16 '17 02:02 hjy2588818

那這個數字究竟設置甚麼才正確啊... 而且怎麼計算出來的index好像怪怪的... 明明差不多一樣的文卻給我0.20

Feb 16 '17 08:02 shtse8

@shtse8 我就默认给3了，数据没有直接删，大于0.25的存在一张表里面，我用第三方在线检测的查询验证（ http://life.chacuo.net/convertsimilar ），发现基本上我这边大于0.45的相似度在90%以上，我的分词跟你的不一样，具体的懒得管了，反正现在比之前的效率高了很多了，我就把大于0.8的全部干掉你的分词用的和我的不一样，所以你还得实际验证一下，看一下认定是重复数据阀值在哪里

我把每篇新增的数据文档实时计算出来，然后取前16位，找到所有前16位一样的，然后做64位的全部比较

123456.... 12 13 14 15 16 23 24 25 26 34 35 36 45 46

这样，就只需要计算少量的数据了

Feb 16 '17 08:02 hjy2588818

@shtse8 感觉还是不太准确，两篇不一样的内容，计算出来的hash指纹居然是1

Feb 18 '17 12:02 hjy2588818

我自己寫了一個simhash了，沒有用這個，這個怪怪的，我還是覺得海明距離再加傳統md5比較好。

2017年2月18日下午8:38，"hjy2588818" [email protected]写道：

@shtse8 https://github.com/shtse8 感觉还是不太准确，两篇不一样的内容，计算出来的hash指纹居然是1

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tgalopin/SimHashPhp/issues/4#issuecomment-280843046, or mute the thread https://github.com/notifications/unsubscribe-auth/AHpgg7mvKEUA2ck_21v1-sxC7CWOYOagks5rduZCgaJpZM4LxnAP .

Feb 18 '17 13:02 shtse8

@shtse8 上传到GitHub了么，求共享

Feb 20 '17 06:02 hjy2588818

就是很簡單的一個庫，我沒有上傳到Github了，我放在這就算了，反正都是自己用為主，我實測了大概一百萬篇中文文章，感覺都不錯，取海明距離 0.95 以上為之相同文章。

至於分詞我是用 https://github.com/hankcs/HanLP 用 Java 寫了一個接口給 PHP 調用。

class Simhash
{
    protected static $length = 64;
    protected static $search = array('0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f');
    protected static $replace = array('0000','0001','0010','0011','0100','0101','0110','0111','1000','1001','1010','1011','1100','1101','1110','1111');
	
    public static function get(array &$set)
    {
        $boxes = array_fill(0, self::$length, 0);
        if (is_int(key($set)))
            $dict = array_count_values($set);
        else
            $dict = &$set;
        foreach ($dict as $element => $weight) {
            
			$hash = hash('md5', $element);
			$hash = str_replace(self::$search, self::$replace, $hash);
			$hash = substr($hash, 0, self::$length);
			$hash = str_pad($hash, self::$length, '0', STR_PAD_LEFT);
			
            for ( $i=0; $i < self::$length; $i++ ) {
				$boxes[$i] += ($hash[$i] == '1') ? $weight : -$weight;
            }
        }
        $s = '';
        foreach ($boxes as $box) {
            if ($box > 0)
                $s .= '1';
            else
                $s .= '0';
        }
		
        return $s;
    }
	
    public static function hd($h1, $h2)
    {
        $dist = 0;
        for ($i=0;$i<self::$length;$i++) {
            if ( $h1[$i] != $h2[$i] )
                $dist++;
        }
        return (self::$length - $dist) / self::$length;
    }
}

Feb 20 '17 21:02 shtse8

@shtse8 厉害，谢谢

Feb 21 '17 03:02 hjy2588818

@shtse8 请问一下你大批量的文章对比是如何做的？

Jun 21 '17 05:06 regboy8

先把指纹计出来储存，然后用海明距离计算相似度。指纹愈长愈好，所记载的资讯会愈多，准确度会愈高，我用的是256长度的指纹。欢迎加我微信一起研究，跟我的用户名一样。

Jun 22 '17 04:06 shtse8

simhashphp simhashphp copied to clipboard

Does this class support elements in Chinese?

simhashphp
simhashphp copied to clipboard