gojieba 词库2.3G导致64G机器不够用

package main

import (
	"encoding/json"
	"flag"
	"fmt"
	"github.com/yanyiwu/gojieba"
	"io"
	"net/http"
	"runtime"
	"strings"
	"time"
)

var (
	host = flag.String("host","127.0.0.1","HTTP服务器主机名")
	port = flag.Int("port",8888,"HTTP服务器端口")
	x = gojieba.NewJieba("/tmp/test.dict.utf8")
)

/**
启动命令如下（其中host(127.0.0.1)、port(8888)可不传，均有默认参数）
go run server.go -host 0.0.0.0 -port 3306
 */
func main()  {
	flag.Parse()

	//将线程数设置为CPU数
	runtime.GOMAXPROCS(runtime.NumCPU())

	http.HandleFunc("/segmentation",Handler)
	fmt.Println(fmt.Sprintf("%s:%d",*host,*port))
	http.ListenAndServe(fmt.Sprintf("%s:%d",*host,*port),nil)
}

func Handler(w http.ResponseWriter, req *http.Request)  {
	start_time := time.Now().UnixNano() / 1000000
	// 得到要分词的文本
	text := req.URL.Query().Get("company_name")
	if text == ""{
		text = req.PostFormValue("company_name")
	}

	words := x.Tag(text)
	split_word := []string{}
	list := make([]string,0)
	for _,word:= range words{
		split_word = strings.Split(word,"/")
		if split_word[1] == "n" {
			list = append(list, split_word[0])
		}
	}
	end_time := time.Now().UnixNano() / 1000000
	fmt.Println("处理时间：",(end_time-start_time),"ms")
	response,_ := json.Marshal(list)

	w.Header().Set("Content-Type", "application/json")
	io.WriteString(w, string(response))
}

注：/tmp/test.dict.utf8单文件大约五千万数据，词库格式如下：

常州市伟芳机械有限公司 2 n
兰州金乐塑胶有限公司 2 n
河南兆龙电气设备有限公司 2 n
青岛德润鑫文化传媒有限公司 2 n
重庆禾加合科技发展有限公司 2 n
潍坊崔旺建材销售有限公司 2 n
甘肃龙发装饰工程有限公司 2 n
任丘市大卫电动车有限公司 2 n
建湖县众友服饰有限公司 2 n
曹县小金豆电子商务有限公司 2 n

Sep 23 '19 15:09 yaokun123

@yanyiwu

Sep 23 '19 15:09 yaokun123

哦，用法应该没啥问题，看来你这个词库数量级可能确实64G不够。。。

Sep 23 '19 15:09 yanyiwu

@yanyiwu 好的，感谢🙏

Sep 24 '19 06:09 yaokun123

你好，有没有什么解决方案？比如能否牺牲一些性能来满足内存

------------------ 原始邮件 ------------------ 发件人: "Yanyi Wu"[email protected]; 发送时间: 2019年9月23日(星期一) 晚上11:42 收件人: "yanyiwu/gojieba"[email protected]; 抄送: "嫁莪! 佷緈鍢"[email protected];"Author"[email protected]; 主题: Re: [yanyiwu/gojieba] 词库2.3G导致64G机器不够用 (#55)

哦，用法应该没啥问题，看来你这个词库数量级可能确实64G不够。。。

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

Sep 24 '19 09:09 yaokun123

建议是清理一下词库，看上去是词库建设不太合理。

Sep 24 '19 15:09 yanyiwu

词库的词是公司名称，后面的词频和词意都是固定的2 n 这种词库优化方向是啥？

------------------ 原始邮件 ------------------ 发件人: "Yanyi Wu"[email protected]; 发送时间: 2019年9月24日(星期二) 晚上11:56 收件人: "yanyiwu/gojieba"[email protected]; 抄送: "嫁莪! 佷緈鍢"[email protected];"Author"[email protected]; 主题: Re: [yanyiwu/gojieba] 词库2.3G导致64G机器不够用 (#55)

建议是清理一下词库，看上去是词库建设不太合理。

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

Sep 25 '19 01:09 yaokun123

第一步，将你的词库 5千万的数量级，分割为好几次处理，例如分为 500 个文件，那么每次就需要处理 10 万行。第二步，将处理后的结果去重。

或者直接采用流的方式打开文档，每次读取一行然后分词处理。

Apr 22 '21 05:04 mmcer

gojieba gojieba copied to clipboard

词库2.3G导致64G机器不够用

gojieba
gojieba copied to clipboard