jieba 关于idf.txt

貌似代码里面没有从语料生成idf.txt的代码，是否遗漏？

Jul 30 '13 15:07 linkerlin

词典和idf.txt都是坐着事先对语料进行训练和分析得到的，不包含在这个工程里面。不过我也很想看到语料分析的代码，期待作者共享！

Aug 07 '13 02:08 lewsn2008

如果大家对这个有兴趣，我愿意分享给大家。当时觉得idf.txt里面的统计不大好，所以自己想办法生成了一份。还有词料统计，用了最大熵的思路，生成了一分语料统计。主要是为了发现新词的。

2013/8/7 lewsn2008 [email protected]

词典和idf.txt都是坐着事先对语料进行训练和分析得到的，不包含在这个工程里面。不过我也很想看到语料分析的代码，期待作者共享！

— Reply to this email directly or view it on GitHubhttps://github.com/fxsjy/jieba/issues/87#issuecomment-22226557 .

Aug 12 '13 03:08 jannson

@linkerlin , @lewsn2008 , 我找到了之前写的生成idf.txt的脚本，基本思路是对一些小说报刊语料进行分词，然后以段落为单位，统计idf.

import jieba
import math
import sys
import re
re_han = re.compile(ur"([\u4E00-\u9FA5]+)")

d={}
total = 0
for line in open("yuliao_onlyseg.txt",'rb'):
    sentence = line.decode('utf-8').strip()
    words = set(jieba.cut(sentence))
    for w in words:
        if w in jieba.FREQ:
            d[w]=d.get(w,0.0) + 1.0
    total+=1
    if total%10000==0:
        print >>sys.stderr,'sentence count', total

new_d = [(k,math.log(v/total)*-1 ) for k,v in d.iteritems()]

for k,v in new_d:
    print k.encode('utf-8'),v

Aug 13 '13 01:08 fxsjy

@fxsjy 非常感谢，作者真是无私的大牛啊，膜拜！

Aug 13 '13 08:08 lewsn2008

谢谢！没看明白为何要 math.log(v/total)*-1

Aug 13 '13 09:08 linkerlin

@linkerlin , 也可以math.log(total/v)

Aug 13 '13 09:08 fxsjy

求语料数据，程序中的yuliao_onlyseg.txt还有吗？想跑一下学习学习，thks

Aug 20 '13 09:08 lewsn2008

一个新的分词库：https://github.com/jannson/yaha ，仅与大家交流学习：

提供了解决结巴分词库的姓名识别，后缀名识别，使用正则表达式等问题的思路；
同时对提取关键字，ChineseAnalyzer进行了小小的优化；
附加了最大熵算法生成新词，自动摘要，比较两文本的相似度算法的实现。

产生这个分词库的原因，是因为在我的一个小小的爬虫，搜索引擎上使用结巴分词库之后，发现了一些小问题优化之后形成的，本来想直接修改结巴代码并提交，但是有一些设计思路区别较大才弄的新的分词库，不是对结巴作者的不敬。

最后感谢结巴库作者，里面的字典以及一些代码思路来自于结巴库。也希望大家以后能有更多交流。

On Tue, Aug 20, 2013 at 5:40 PM, lewsn2008 [email protected] wrote:

求语料数据，程序中的yuliao_onlyseg.txt还有吗？想跑一下学习学习，thks

— Reply to this email directly or view it on GitHubhttps://github.com/fxsjy/jieba/issues/87#issuecomment-22933170 .

Aug 21 '13 08:08 jannson

@jannson , 已关注。

Aug 21 '13 09:08 fxsjy

@lewsn2008 , 这个文件有200多MB，怎么发给你？

Aug 21 '13 09:08 fxsjy

能否发个dropbox链接，我也想下载一份。

[email protected]

发件人： Sun Junyi 发送时间： 2013-08-21 17:13 收件人： fxsjy/jieba 主题： Re: [jieba] 关于idf.txt (#87) @lewsn2008 , 这个文件有200多MB，怎么发给你？ — Reply to this email directly or view it on GitHub.

Aug 21 '13 09:08 yanyiwu

@aszxqw , @lewsn2008 ，试用了一下百度云盘，分享地址：http://pan.baidu.com/share/link?shareid=4094310849&uk=1124369080

Aug 22 '13 01:08 fxsjy

    if w in jieba.FREQ:
        d[w]=d.get(w,0.0) + 1.0

jieba.FREQ现在已经不存在了，请问现在应该如何写if w in ?

Feb 24 '18 09:02 wowxunyl

Try jieba.dt.FREQ

On Feb 24, 2018 17:55, "xunyl" [email protected] wrote:

if w in jieba.FREQ:
    d[w]=d.get(w,0.0) + 1.0

jieba.FREQ现在已经不存在了，请问现在应该如何写if w in ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fxsjy/jieba/issues/87#issuecomment-368216543, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0SqiLtUKC9m_qw6PFto87F-Vh10uWwks5tX9x8gaJpZM4A2-YO .

Feb 24 '18 10:02 alexwwang

原作者的code應該是在Python 2.x上寫的, 有些東西已經不存在, 或是寫法上需要稍稍調整. 小弟改寫了一下, 給大家參考

import jieba import math import sys import re

re_han = re.compile(r"[\u4E00-\u9FA5]+") d = {} total = 0 for line in open(r"語料庫的路徑與檔名.txt", 'rb'): sentence = line.decode('utf-8').strip() words = set(jieba.cut(sentence)) for w in words: if w in jieba.dt.FREQ: d[w] = d.get(w, 0.0) + 1.0 total += 1 if total % 10000 == 0: print >> sys.stderr, 'sentence count', total

new_d = [(k, math.log(v / total) * -1) for [k, v] in d.items()]

for k, v in new_d: print(k, v)

Jan 06 '21 11:01 Trevor0713

jieba jieba copied to clipboard

关于idf.txt

jieba
jieba copied to clipboard