friso
friso copied to clipboard
High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other...
Frisoæ¯ä»ä¹ï¼
Friso æ¯ä½¿ç¨ c è¯è¨å¼åçä¸æ¬¾å¼æºçé«æ§è½ä¸æåè¯å¨ï¼ä½¿ç¨æµè¡çmmsegç®æ³å®ç°ãå®å ¨åºäºæ¨¡åå设计åå®ç°ï¼å¯ä»¥å¾æ¹ä¾¿çæ¤å ¥å ¶ä»ç¨åºä¸ï¼ ä¾å¦ï¼MySQLï¼PHPï¼å¹¶ä¸æä¾äºphp5, php7, ocaml, luaçæ件å®ç°ãæºç æ éä¿®æ¹å°±è½å¨åç§å¹³å°ä¸ç¼è¯ä½¿ç¨ï¼å è½½å® 20 ä¸çè¯æ¡ï¼å åå ç¨ç¨³å®ä¸º 14.5M.
Frisoæ ¸å¿åè½ï¼
- [x] ä¸æåè¯ï¼mmsegç®æ³ + Friso ç¬åçä¼åç®æ³ï¼åç§åå模å¼ã
- [ ] å ³é®åæåï¼åºäºtextRankç®æ³ã
- [ ] å ³é®çè¯æåï¼åºäºtextRankç®æ³ã
- [ ] å ³é®å¥åæåï¼åºäºtextRankç®æ³ã
Frisoä¸æåè¯ï¼
åç§åå模å¼ï¼
- [x] ç®æ模å¼ï¼FMM ç®æ³ï¼éåé度è¦æ±åºåã
- [x] å¤æ模å¼- MMSEG åç§è¿æ»¤ç®æ³ï¼å ·æè¾é«çå²ä¹å»é¤ï¼åè¯åç¡®çè¾¾å°äº98.41%ã
- [x] æ£æµæ¨¡å¼ï¼åªè¿åè¯åºä¸å·²æçè¯æ¡ï¼å¾éåæäºåºç¨åºåã(1.6.1çæ¬å¼å§)ã
- [ ] æå¤æ¨¡å¼ï¼ç»ç²åº¦ååï¼ä¸ä¸ºæ£ç´¢èçï¼é¤äºä¸æå¤çå¤ï¼ä¸å ·å¤ä¸æç人åï¼æ°åè¯å«çæºè½åè½ï¼å ¶ä»ä¸å¤æ模å¼ä¸è´ï¼è±æï¼ç»åè¯çï¼ã
åè¯åè½ç¹æ§ï¼
- [x] åæ¶æ¯æ对 UTF-8/GBK ç¼ç çååï¼æ¯æ php5 å php7 æ©å±å sphinx token æ件ã
- [x] æ¯æèªå®ä¹è¯åºãå¨ dict æ件夹ä¸ï¼å¯ä»¥é便添å /å é¤/æ´æ¹è¯åºåè¯åºè¯æ¡ï¼å¹¶ä¸å¯¹è¯åºè¿è¡äºåç±»ã
- [x] ç®ä½/ç¹ä½/ç®ä½æ··åæ¯æ, å¯ä»¥æ¹ä¾¿çé对ç®ä½ï¼ç¹ä½æè ç®ç¹ä½ååãåæ¶è¿å¯ä»¥ä»¥æ¤å®ç°ç®ç¹ä½çç¸äºæ£ç´¢ã
- [x] æ¯æä¸è±/è±ä¸æ··åè¯çè¯å«(ç»´æ¤è¯åºå¯ä»¥è¯å«ä»»ä½ä¸ç§ç»å)ãä¾å¦ï¼å¡æok, æ¼äº®mm, cè¯è¨ï¼ICå¡ï¼åå¦a梦ã
- [x] å¾å¥½çè±ææ¯æï¼è±ææ ç¹ç»åè¯è¯å«, ä¾å¦c++, c#, çµåé®ä»¶ï¼ç½åï¼å°æ°ï¼ç¾åæ°ã
- [x] èªå®ä¹ä¿çæ ç¹ï¼ä½ å¯ä»¥èªå®ä¹ä¿çå¨ååç»æä¸çæ ç¹ï¼è¿æ ·å¯ä»¥è¯å«åºä¸äºå¤æçç»åï¼ä¾å¦ï¼c++, k&rï¼code.google.comã
- [x] å¤æè±æååçäºæ¬¡ååï¼é»è®¤ Friso ä¼ä¿çæ°åååæ¯çåç»åï¼å¼å¯æ¤åè½ï¼å¯ä»¥è¿è¡äºæ¬¡ååæé«æ£ç´¢çå½ä¸çãä¾å¦ï¼qq2013ä¼è¢«ååæï¼qq/ 2013/ qq2013ã
- [x] æ¯æé¿æ伯æ°å/å°æ°åºæ¬åååä½çè¯å«ï¼ä¾å¦2012å¹´ï¼1.75ç±³ï¼5å¨ï¼120æ¤ï¼38.6âã
- [x] èªå¨è±æåè§/åè§ï¼å¤§å/å°å转æ¢ã
- [x] åä¹è¯å¹é ï¼èªå¨ä¸æ/è±æåä¹è¯è¿½å . (éè¦å¨ friso.ini ä¸å¼å¯ friso.add_syn é项)ã
- [x] èªå¨ä¸è±æåæ¢è¯è¿æ»¤ã(éè¦å¨ friso.ini ä¸å¼å¯ friso.clr_stw é项)ã
- [x] å¤é ç½®æ¯æ, å®å ¨çåºç¨äºå¤è¿ç¨/å¤çº¿ç¨ç¯å¢ã
Frisoå¿«éä½éªï¼
ç»ç«¯æµè¯ï¼
- cdå° Frisoæ ¹ç®å½ã
- make
- è¿è¡ï¼./src/friso -init ./friso.ini
- ä½ å°çå°ç±»ä¼¼å¦ä¸çç»ç«¯çé¢
- å¨å æ å¤è¾å ¥ææ¬å¼å§æµè¯
Initialized in 0.088911sec
Mode: Complex
+-Version: 1.6.2 (UTF-8)
+---------------------------------------------------------------+
| Friso - a Chinese word segmentation writen by c. |
| bug report email - [email protected]. |
| or: visit https://github.com/lionsoul2014/friso. |
| java edition for https://github.com/lionsoul2014/jcseg |
| type 'quit' to exit the program. |
+---------------------------------------------------------------+
friso>>
æµè¯æ ·æ¿ï¼
åè¯ææ¬
æ§ä¹ååä¹è¯:ç 究çå½èµ·æºï¼æ··åè¯: åBè¶
æ£æ¥èº«ä½ï¼xå°çº¿æ¬è´¨æ¯ä»ä¹ï¼ä»å¤©å»å¥é½ktvå±å¡æokå»ï¼åå¦a梦æ¯ä¸ä¸ªå¨æ¼«ä¸ç主è§ï¼åä½åå
¨è§: 2009å¹´ï¼æï¼æ¥å¼å§å¤§å¦ä¹æ
ï¼å²³é³ä»å¤©çæ°æ¸©ä¸º38.6â, ä¹å°±æ¯101.48â, è±ææ°å: bug report [email protected] or visit http://code.google.com/p/jcseg, we all admire the hacker spirit!ç¹æ®æ°å: â â© â½ ã©.
åè¯ç»æï¼
æ§ä¹ å åä¹è¯ : ç 究 ç¢ç£¨ ç 讨 é»ç çå½ èµ·æº ï¼ æ··åè¯ : å bè¶
æ£æ¥ èº«ä½ ï¼ xå°çº¿ æ¬è´¨ æ¯ ä»ä¹ ï¼ ä»å¤© å» å¥é½ktv å± å¡æok å» ï¼ åå¦a梦 æ¯ ä¸ä¸ª å¨æ¼« ä¸ ç ä¸»è§ ï¼ åä½ å å
¨è§ : 2009å¹´ 8æ 6æ¥ å¼å§ å¤§å¦ ä¹æ
ï¼ å²³é³ ä»å¤© ç æ°æ¸© 为 38.6â , ä¹å°±æ¯ 101.48â , è±æ è±è¯ æ°å : bug report example gmail com [email protected] or visit http : / / code google com code.google.com / p / jcseg , we all admire appreciate like love enjoy the hacker spirit mind ! ç¹æ® æ°å : .
Frisoå®è£
Linux:
cdå°frisoçæ ¹ç®å½ï¼è¿è¡ï¼
make
sudo make install
# for testing
make testing
å¤æ³¨ï¼å¦ææ¯ 64 ä½çç³»ç»ï¼è¯·å°/usr/lib/libfriso.so æ·è´ä¸ä»½å°/usr/lib64 ä¸
Winnt:
- ä½¿ç¨ VS ç¼è¯å¾å° dll å lib æ件ï¼å ·ä½å¯ä»¥åè Friso 讨论ï¼http://www.oschina.net/question/853816_135216
- ä½¿ç¨ cygwin ä»æºç ç¼è¯ï¼ å é¤åæç Makefile, æ´æ¹ Makefile.cygwin 为 Makefile, æå¼ cygwin çç»ç«¯ï¼cd å° Friso ç src ç®å½ï¼è¿è¡:
make
å¤æ³¨ï¼å¨Frisoçsrcç®å½ä¸å³å¯å¾å°friso.exeåfriso.dllæ件ã
Frisoé ç½®
Friso è¦åçé 置工ä½å¾ç®åï¼æ¾å° friso.ini é ç½®æ件, 使ç¨ææ¬ç¼è¾å¨æå¼å³å¯
é 置说æï¼
# friso configuration file.
# do not change the name of the left key.
# @email [email protected]
# @date 2012-12-20
#
# charset, only UTF8 and GBK support.
# set it with UTF8(0) or GBK(1)
friso.charset = 0
# lexicon directory absolute path.
# the value must end with '/'
# this will tell friso how to find friso.lex.ini configuration file and all the lexicon files.
#
# if it is not start with '/' for linux, or matches no ':' for winnt in its value
# friso will search the friso.lex.ini relative to friso.ini
# absolute path search:
# linux: friso.lex_dir = /c/products/friso/dict/UTF-8/
# Winnt: friso.lex_dir = D:/products/friso/dict/UTF-8/
# relative path search (All system)
friso.lex_dir = ./vendors/dict/UTF-8/
# the maximum matching length.
friso.max_len = 5
# 1 for recognition chinese name.
# and 0 for closed it.
friso.r_name = 1
# the maximum length for the cjk words in a
# chinese and english mixed word.
friso.mix_len = 2
# the maxinum length for the chinese last name adron.
friso.lna_len = 1
# append the synonyms words
friso.add_syn = 1
# clear the stopwords or not (1 to open it and 0 to close it)
# @date 2013-06-13
friso.clr_stw = 0
# keep the unrecongized words or not (1 to open it and 0 to close it)
# @date 2013-06-13
friso.keep_urec = 0
# use sphinx output style like 'admire|love|enjoy einsten'
# @date 2013-10-25
friso.spx_out = 0
# start the secondary segmentation for complex english token.
friso.en_sseg = 1
# min length of the secondary segmentation token. (better larger than 1)
friso.st_minl = 2
# default keep punctuations for english token.
friso.kpuncs = @%.#&+
# the threshold value for a char not a part of a chinese name.
friso.nthreshold = 2000000
# default mode for friso.
# 1 : simple mode - simply maxmum matching algorithm.
# 2 : complex mode - four rules of mmseg alogrithm.
# 3 : detect mode - only return the words that the do exists in the lexicon
friso.mode = 2
è¯åºçé ç½®ï¼
- friso.iniä¸ friso.lex_dir æåfrisoä¾èµçè¯åºç®å½, ä¿®æ¹å ¶å¼ä¸ºè¯åºç®å½ç»å¯¹å°å, 并ä¸å¿ 须以â/âç»å°¾ãä¾å¦ï¼friso.lex_dir = /usr/lib/friso/dict/
- è¯åºå为UTF-8åGBKç¼ç çï¼æ ¹æ®ä½ 使ç¨çç¼ç éè¦éæ©å 载对åºç¼ç çè¯åºã
Frisoæ件
Frisoç®åæä¾äºå¯¹php5, php7, ocaml, luaçåè¯æ件ï¼
è¯è¨ | binding | ä½è | ç¶æ |
---|---|---|---|
php | php5-binding | dongyado<[email protected]> | å·²å®æ |
php | php7-binding | dongyado<[email protected]> | å·²å®æ |
ocaml | ocaml-binding | https://github.com/kandu | å·²å®æ |
sphinx | sphinx-binding | lionsoul<[email protected]> | å¼åä¸ |
lua | lua-binding | lionsoul<[email protected]> | å¼åä¸ |
Frisoåè¯æ¥å£
ä¸ä¸ªå®æ´çdemo:
/* 第ä¸æ¥ï¼ç³æä¸ä¸ªå¯¹è±¡ */
friso_t friso; /* Friso åè¯å¯¹è±¡ */
friso_config_t config; /* Friso é
置对象 */
friso_task_t task; /* Friso ä»»å¡å¯¹è±¡ */
/* 第äºæ¥ï¼åå§åç¸åºç对象 */
friso = friso_new();
config = friso_new_config();
task = friso_new_task();
/* ä»friso.inié
ç½®æ件ä¸åå§å friso */
if (friso_init_from_ifile(friso, config, "friso.iniæ件å°å") != 1) {
/* friso åå§å失败 */
}
/*
* åå模å¼é»è®¤æ¥èªfriso.iniä¸ç设置
* å¯ä»¥éè¿friso_set_modeå½æ°èªå®ä¹åå模å¼(ç®æï¼å¤æï¼æ£æµæ¨¡å¼)
* ç®æ模å¼ï¼__FRISO_SIMPLE_MODE__
* å¤æ模å¼ï¼__FRISO_COMPLEX_MODE__
* æ£æµæ¨¡å¼ï¼__FRISO_DETECT_MODE__
* ä¾å¦ï¼è¿é设置为使ç¨å¤æ模å¼åè¯ï¼
*/
friso_set_mode(config, __FRISO_COMPLEX_MODE__);
/* 第ä¸æ¥ï¼è®¾ç½®åè¯å
容 */
friso_set_text(task, "åè¯çææ¬");
/* 第åæ¥ï¼è·ååè¯å
容 */
while (config->next_token(friso, config, task) != NULL) {
/*
taskåå¨äºåè¯çç»æï¼
task->token->word: è¯æ¡å
容
task->token->offset: è¯æ¡å¨åå§ææ¬çoffset
task->token->length: è¯æ¡çé¿åº¦(åèæ°)
task->token->rlen: è¯æ¡ççæ£åèæ°(Friso转æ¢åçé¿åº¦-åèæ°)
*/
printf("%s ", task->token->word);
}
/* 第äºæ¥ï¼éæ¾å¯¹è±¡ */
friso_free_task(task);
friso_free_config(config);
friso_free(friso);
å¤æ³¨ï¼
- 第ä¸æ¥å第åæ¥å¯ä»¥åå¤è°ç¨ï¼ä½¿ç¨friso_set_setxéç½®åè¯å 容å³å¯ã
- 对äºå¤çº¿ç¨ç¯å¢ï¼ä¸å线ç¨å ±äº«frisoåconfig对象ï¼éè¦åå«åå§åtask使ç¨å¯¹è±¡ã
Frisoè¯åºç®¡ç
è¯åºåç±»å®ä¹
Friso å é¨å¯¹è¯åºè¿è¡äºåç±», å¨ç®¡çè¯åºåä½ éè¦å äºè§£Frisoçè¯åºç±»å«ï¼åç±»æ´åå¼ä»¥åå«ä¹å¦ä¸ï¼
typedef enum {
__LEX_CJK_WORDS__ = 0, // æ®é CJK è¯åº
__LEX_CJK_UNITS__ = 1, // CJK åä½è¯åº
__LEX_ECM_WORDS__ = 2, // è±ä¸æ··åè¯(ä¾å¦: b è¶
)
__LEX_CEM_WORDS__ = 3, // ä¸è±æ··åè¯(ä¾å¦: å¡æ ok).
__LEX_CN_LNAME__ = 4, // ä¸æå§æ°
__LEX_CN_SNAME__ = 5, // ä¸æåå§åè¯åº
__LEX_CN_DNAME1__ = 6, // ä¸æåå§åé¦åè¯åº
__LEX_CN_DNAME2__ = 7, // ä¸æåå§åå°¾åè¯åº
__LEX_CN_LNA__ = 8, // ä¸æå§æ°ä¿®é¥°è¯è¯åº
__LEX_STOPWORDS__ = 9, // åæ¢è¯è¯åº
__LEX_ENPUN_WORDS__ = 10, // è±æåæ ç¹æ··åè¯åº(ä¾å¦: c++)
__LEX_OTHER_WORDS__ = 15, // æ ç¨
__LEX_NCSYN_WORDS__ = 16 // æ ç¨
} friso_lex_t;
è¯åºé ç½®æ件
è¯åºç®å½ä¸ç friso.lex.ini é ç½®æ件åå¨äºè¯åºç±»å«ä»¥å对åºç±»å«ä¸çè¯åºæ件å称ï¼æ¯ä¸å¯¹å¤çå ³ç³»ï¼é»è®¤çé ç½®å¦ä¸ï¼
# main lexion
__LEX_CJK_WORDS__ :[
lex-main.lex;
lex-admin.lex;
lex-chars.lex;
lex-cn-mz.lex;
lex-cn-place.lex;
lex-company.lex;
lex-festival.lex;
lex-flname.lex;
lex-food.lex;
lex-lang.lex;
lex-nation.lex;
lex-net.lex;
lex-org.lex;
lex-touris.lex;
# add more here
]
# single chinese unit lexicon
__LEX_CJK_UNITS__ :[
lex-units.lex;
]
# chinese and english mixed word lexicon like "bè¶
".
__LEX_ECM_WORDS__:[
lex-ecmixed.lex;
]
# english and chinese mixed word lexicon like "å¡æok".
__LEX_CEM_WORDS__:[
lex-cemixed.lex;
]
# chinese last name lexicon.
__LEX_CN_LNAME__:[
lex-lname.lex;
]
# single name words lexicon.
__LEX_CN_SNAME__:[
lex-sname.lex;
]
# first word of a double chinese name.
__LEX_CN_DNAME1__:[
lex-dname-1.lex;
]
# second word of a double chinese name.
__LEX_CN_DNAME2__:[
lex-dname-2.lex;
]
# chinese last name decorate word.
__LEX_CN_LNA__:[
lex-ln-adorn.lex;
]
# stopwords lexicon
__LEX_STOPWORDS__:[
lex-stopword.lex;
]
# english and punctuation mixed words lexicon.
__LEX_ENPUN_WORDS__:[
lex-en-pun.lex;
]
# english words(for synonyms words)
__LEX_EN_WORDS__:[
lex-en.lex;
]
æ°å¢è¯åºæ件
- 确认类å«ï¼é¦å ç¡®è®¤ä½ è¦å å ¥çè¯åºæ件çç±»å«.
- æ°å»ºè¯åºï¼ä¾å¦: ææ³æ·»å ä¸ä¸ªè¯åºæ件ä¸é¨ç¨æ¥åå¨æ¤ç©çåå, å¨dict/ä¸æ°å»º lex-plants.lexæ件, ç¶åæç §ä¸ä¸ªè¯æ¡ä¸è¡çè§åå å ¥è¯æ¡å°è¯¥æ件ä¸.
- å¯ç¨è¯åºï¼æ¥ä¸æ¥è¿æä¸ä¸ªéè¦çæ¥éª¤å°±æ¯å°è¯¥è¯åºå½ç±»å° friso.lex.ini ä¸å», é常çè¯åºé½æ¯ CJK è¯åº, ä¹å°±æ¯å°lex-plants.lex ä½ä¸ºä¸è¡å å ¥å° LEX_CJK_WORDS ç±»å«ä¸å³å¯ã
# main lexion
__LEX_CJK_WORDS__ :[
lex-main.lex;
lex-admin.lex;
lex-chars.lex;
lex-cn-mz.lex;
lex-cn-place.lex;
lex-company.lex;
lex-festival.lex;
lex-flname.lex;
lex-food.lex;
lex-lang.lex;
lex-nation.lex;
lex-net.lex;
lex-org.lex;
lex-touris.lex;
# æ°å¢çæ¤ç©å称è¯åº
lex-plants.lex;
# add more here
]
ç»è¯åºæ°å¢è¯æ¡
æ¾å°å¯¹åºçè¯åºæ件, 使ç¨ææ¬ç¼è¾å¨æå¼, å°è¦å å ¥çè¯æ¡æç §ä¸é¢çæ ¼å¼ä½ä¸ºä¸è¡å å ¥å³å¯(å¤æ³¨ï¼å å ¥å建议å 确认ä¸ç¸åçè¯æ¡ä¸åå¨)ã
Friso è¯åºè¯æ¡æ ¼å¼:
è¯æ¡/åä¹è¯éå
åä¹è¯æ²¡æä½¿ç¨ null 代æ¿, å¤ä¸ªåä¹è¯ä½¿ç¨è±æéå·éå¼ï¼ä¾å¦ï¼
ä½ å¥½/null
ç 究/ç¢ç£¨,ç 讨,é»ç
ç¸å ³éå½
åèæç®
- MMSEGç®æ³åèï¼http://technology.chtsai.org/mmseg/
ææ¯äº¤æµå享
- æ§ççåèpdfåèæç®ï¼è¯·åè项ç®ä¸ç friso-help-doc.pdf
- 使ç¨æ¡ä¾å ¸èï¼RediSearch~ä¿¡æ¯æ£ç´¢
- NLP交æµå享ï¼å¾®ä¿¡ï¼lionsoul2014(请å¤æ³¨Friso)ï¼ï¼±ï¼±ï¼1187582057(å¾å°å ³æ³¨)