mecab-docs-en
mecab-docs-en copied to clipboard
Translation of the MeCab documentation to English
MeCab English Documentation
This is based on the MeCab documentation.
What is MeCab?
MeCab is an open source morphological analysis engine developed as a joint research project between Kyoto University Information Research department and the Nippon Telegraph and Telecommunications Communication Science Laboratories. It is built with the goal of general purpose analysis and does not depend on any particular language corpus/dictionary.
MeCab uses Conditional Random Fields (CRF) parameter estimation, improving upon the Hidden Markov Models as used by ChaSen. MeCab also typically performs faster than ChaSen, Juman, and KAKASI. Incidentally, mekabu is the author's favourite dish.
(Translator's note: MeCab in Japanese is pronounced 'mekabu' - which is the thick part of wakame seaweed just above the root)
Features
- Generic design that does not depend on a dictionary corpus
- Based on Conditional Random Fields (CRF) for high precision analysis
- Faster than ChaSen and KAKASI libraries
- Dictionary search algorithm uses high speed TRIE structure double-array
- Library is re-entrant
- Bindings for various scripting languages (Perl, Ruby, Python, Java, C#)
Comparison
(TODO)
Mailing List
Changelog
(TODO)
Downloads
See project page for up to date downloads
Installation
Unix
Requirements
- C++ compiler (compiles with g++ 3.4.3 and VC7)
- iconv (libiconv): used for dictionary format conversion
MeCab Installation
Installation is the same as typical free software.
% tar zxfv mecab-X.X.tar.gz
% cd mecab-X.X
% ./configure
% make
% make check
% su
# make install
Dictionary installation
% tar zxfv mecab-ipadic-2.7.0-XXXX.tar.gz
% mecab-ipadic-2.7.0-XXXX
% ./configure
% make
% su
# make install
Windows
For binary installation, please use the self-extracting installer. Do the same for dictionary installation.
Usage
Getting Started
See below MeCab accepting input from stdin upon startup. MeCab assumes one line per sentence.
% mecab
縺吶b繧ゅb繧ゅb繧ゅb繧ゅ?縺?■
縺吶b繧? 蜷崎ゥ?荳?闊ャ,*,*,*,*,縺吶b繧?繧ケ繝「繝「,繧ケ繝「繝「
繧? 蜉ゥ隧?菫ょ勧隧?*,*,*,*,繧?繝「,繝「
繧ゅb 蜷崎ゥ?荳?闊ャ,*,*,*,*,繧ゅb,繝「繝「,繝「繝「
繧? 蜉ゥ隧?菫ょ勧隧?*,*,*,*,繧?繝「,繝「
繧ゅb 蜷崎ゥ?荳?闊ャ,*,*,*,*,繧ゅb,繝「繝「,繝「繝「
縺ョ 蜉ゥ隧?騾」菴灘喧,*,*,*,*,縺ョ,繝?繝?縺?■ 蜷崎ゥ?髱櫁?遶?蜑ッ隧槫庄閭ス,*,*,*,縺?■,繧ヲ繝?繧ヲ繝?EOS
The output format significantly differs from ChaSen, becoming:
陦ィ螻、蠖「\t蜩∬ゥ?蜩∬ゥ樒エー蛻?。?,蜩∬ゥ樒エー蛻?。?,蜩∬ゥ樒エー蛻?。?,豢サ逕ィ蠖「,豢サ逕ィ蝙?蜴溷ス「,隱ュ縺ソ,逋コ髻ウ
or in English:
Original Form\t
Part of Speech,
Part of Speech section 1,
Part of Speech section 2,
Part of Speech section 3,
Conjugated form,
Inflection,
Reading,
Pronounciation
If a file is passed in as the argument, it becomes the analysis target. It is also possible to direct output to a file using the -o option.
% mecab INPUT -o OUTPUT
Division
Use the -O option as below
% mecab -O wakati
螟ェ驛弱?縺薙?譛ャ繧剃コ碁ヮ繧定ヲ九◆螂ウ諤ァ縺ォ貂。縺励◆縲?螟ェ驛?縺ッ 縺薙? 譛ャ 繧?莠碁ヮ 繧?隕?縺?螂ウ諤ァ 縺ォ 貂。縺?縺?縲?
Change output format
Use the -O option as below
% mecab -Oyomi (Assign readings)
% mecab -Ochasen (ChaSen compatible)
% mecab -Odump (Full information dump)
These output formats are stored in /usr/local/lib/mecab/ipadic/dicrc
.
The user can also create custom definitions. Please take a look
at the Output Formats documentation (TODO).
Advanced Usage
Changing the character code
(TODO)
% tar zxfv mecab-ipadic-2.7.0-xxxx
% cd mecab-ipadic-2.7.0-xxxx
% ./configure --with-charset=sjis
% make
% tar zxfv mecab-ipadic-2.7.0-xxxx
% ./configure --with-charset=utf8
% make
(TODO)
% cd mecab-ipadic-2.7.0-xxxx
% /usr/local/libexec/mecab/mecab-dict-index -f euc-jp -t utf-8
# make install
UTF-8 only mode
(TODO)
Unknown Word Estimation
(TODO)
繝帙Μ繧ィ繝「繝ウ蟶?繝帙Μ繧ィ繝「繝ウ 蜷崎ゥ?蝗コ譛牙錐隧?蝨ー蝓?荳?闊ャ,*,*,*
蟶? 蜷崎ゥ?謗・蟆セ,蝨ー蝓?*,*,*,蟶?繧キ,繧キ
EOS
繝帙Μ繧ィ繝「繝ウ縺輔s
繝帙Μ繧ィ繝「繝ウ 蜷崎ゥ?蝗コ譛牙錐隧?莠コ蜷?荳?闊ャ,*,*,*
縺輔s 蜷崎ゥ?謗・蟆セ,莠コ蜷?*,*,*,縺輔s,繧オ繝ウ,繧オ繝ウ
(TODO)
%mecab --unk-feature "譛ェ遏・隱?quot;
繝帙Μ繧ィ繝「繝ウ縺輔s
繝帙Μ繧ィ繝「繝ウ 譛ェ遏・隱?縺輔s 蜷崎ゥ?謗・蟆セ,莠コ蜷?*,*,*,縺輔s,繧オ繝ウ,繧オ繝ウ
N-Best Solution Output
(TODO)
% mecab -N2
莉頑律繧ゅ@縺ェ縺?→縺ュ縲?莉頑律 蜷崎ゥ?蜑ッ隧槫庄閭ス,*,*,*,*,莉頑律,繧ュ繝ァ繧ヲ,繧ュ繝ァ繝シ
繧? 蜉ゥ隧?菫ょ勧隧?*,*,*,*,繧?繝「,繝「
縺? 蜍戊ゥ?閾ェ遶?*,*,繧オ螟峨?繧ケ繝ォ,譛ェ辟カ蠖「,縺吶k,繧キ,繧キ
縺ェ縺? 蜉ゥ蜍戊ゥ?*,*,*,迚ケ谿翫?繝翫う,蝓コ譛ャ蠖「,縺ェ縺?繝翫う,繝翫う
縺ィ 蜉ゥ隧?謗・邯壼勧隧?*,*,*,*,縺ィ,繝?繝?縺ュ 蜉ゥ隧?邨ょ勧隧?*,*,*,*,縺ュ,繝?繝?縲? 險伜捷,蜿・轤ケ,*,*,*,*,縲?縲?縲?EOS
莉頑律 蜷崎ゥ?蜑ッ隧槫庄閭ス,*,*,*,*,莉頑律,繧ュ繝ァ繧ヲ,繧ュ繝ァ繝シ
繧ゅ@ 蜑ッ隧?荳?闊ャ,*,*,*,*,繧ゅ@,繝「繧キ,繝「繧キ
縺ェ縺? 蠖「螳ケ隧?閾ェ遶?*,*,蠖「螳ケ隧槭?繧「繧ヲ繧ェ谿オ,蝓コ譛ャ蠖「,縺ェ縺?繝翫う,繝翫う
縺ィ 蜉ゥ隧?謗・邯壼勧隧?*,*,*,*,縺ィ,繝?繝?縺ュ 蜉ゥ隧?邨ょ勧隧?*,*,*,*,縺ュ,繝?繝?縲? 險伜捷,蜿・轤ケ,*,*,*,*,縲?縲?縲?EOS
Acknowledgements
Jorge Nocedal for making the FORTRAN implementation of L-BFGS open to the public.
http://www.ece.northwestern.edu/~nocedal/lbfgs.html
J. Nocedal. Updating Quasi-Newton Matrices with Limited Storage (1980), Mathematics of Computation 35, pp. 773-782. D.C. Liu and J. Nocedal. On the Limited Memory Method for Large Scale Optimization (1989), Mathematical Programming B, 45, 3, pp. 503-528.
繧√°縺カ