Kyle Tse comments

Results 163 comments of


                                            Kyle Tse

Does this class support elements in Chinese?

我已經是用自己的中文分詞了但計算出來的相似度好似很古怪。看來不是用海明距離。 On 14 Feb 2017 4:48 pm, "hjy2588818" wrote: > 把默认的英文分词换成中文的分词准确率就基本上OK了 > > — > You are receiving this because you authored the thread. > Reply to this...

Does this class support elements in Chinese?

GaussianComparator 有哪些特點？為甚麼不用海明距離呢？我看一般都是用海明距離的。

Does this class support elements in Chinese?

請教一下 `GaussianComparator(30)` 中用 30 deviation 是有甚麼作用？我看它預設好像是 `3`

Does this class support elements in Chinese?

我分詞是用HanLP 的CRF，分詞結果我是很滿意的，一直都在使用，只是SimHash我還在想自己去做還是用這個工具，不太明白`GaussianComparator`比`Hamming Distance`有甚麼優勢。

Does this class support elements in Chinese?

而且我有看到這套工具 `\Tga\SimHash\SimHash::SIMHASH_64` 是用CRC64來做傳統Hash 為甚麼不用`MD5`呢? 感覺上CRC分佈不夠平均。

Does this class support elements in Chinese?

那30的原因是?

Does this class support elements in Chinese?

那這個數字究竟設置甚麼才正確啊... 而且怎麼計算出來的index好像怪怪的... 明明差不多一樣的文卻給我0.20

Does this class support elements in Chinese?

我自己寫了一個simhash了，沒有用這個，這個怪怪的，我還是覺得海明距離再加傳統md5比較好。 2017年2月18日下午8:38，"hjy2588818" 写道： > @shtse8 感觉还是不太准确，两篇不一样的内容，计算出来的hash指纹居然是1 > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub >...

Does this class support elements in Chinese?

就是很簡單的一個庫，我沒有上傳到Github了，我放在這就算了，反正都是自己用為主，我實測了大概一百萬篇中文文章，感覺都不錯，取海明距離 0.95 以上為之相同文章。至於分詞我是用 https://github.com/hankcs/HanLP 用 Java 寫了一個接口給 PHP 調用。 ``` class Simhash { protected static $length = 64; protected static $search = array('0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f'); protected static $replace = array('0000','0001','0010','0011','0100','0101','0110','0111','1000','1001','1010','1011','1100','1101','1110','1111'); public...

Does this class support elements in Chinese?

先把指纹计出来储存，然后用海明距离计算相似度。指纹愈长愈好，所记载的资讯会愈多，准确度会愈高，我用的是256长度的指纹。欢迎加我微信一起研究，跟我的用户名一样。