wordfreq
wordfreq copied to clipboard
No `mecab-ipadic-utf8` on centos 7, how can I use wordfreq on Japanese in this case?
Hi, I try to use wordfreq on Japanese on Centos 7. I keep getting an error of Couldn't find the MeCab dictionary named 'mecab-ipadic-utf8'
, however, there's no such package on Centos 7. It's called mecab-ipadic
, how can I run wordfreq on Centos 7 in this case? Thank you so much.
To be able to use wordfreq in Japanese, you need to have a UTF-8 compatible version of MeCab installed. If your package manager doesn't provide one (I checked and confirmed that the CentOS version is the incompatible EUC-JP version), you need to install it from source:
git clone https://github.com/taku910/mecab
cd mecab/mecab
./configure --enable-utf8-only
make
sudo make install
That should install MeCab with a UTF-8 Japanese dictionary in /usr/local/lib/mecab/dic
, where wordfreq will find it.
Oh, there's more that you need to actually get the dictionary:
cd ../mecab-ipadic
./configure --enable-utf8-only
make
sudo make install
I haven't confirmed that this part works, unfortunately, and I can't read Japanese well enough to follow the documentation.
Thanks for your comment. Referring to https://centos.pkgs.org/7/nux-dextop-x86_64/mecab-ipadic-2.7.0.20070801-10.el7.nux.1.x86_64.rpm.html, I thought mecab-ipadic
on centos 7 is for UTF-8 use ony. So I wonder for wordfreq/mecab.py
, should mecab-ipadic
added to the list of line 58 'ja': ['mecab-ipadic-utf8', 'ipadic-utf8']
. Because now I have mecab-python3
work with current
configuration, I guess it's not utf8 related (maybe I'm wrong).
Oh, I see! On CentOS, unlike on Ubuntu, the unmarked version is the UTF-8 one. I saw the reference to "EUCJP", but that's a separate version of the package, marked because it's the EUC-JP version. I do appreciate it defaulting to UTF-8.
So it sounds like that fix would work for CentOS. The problem, then, is that the dictionary of the same name on Ubuntu would be the wrong one. When you run MeCab with the wrong encoding of dictionary, it runs, but outputs complete nonsense results. Can you think of any way to select the correct dictionary for the operating system?
Thank you for confirming my fix for CentOS. To be honest, I don't know. I think it's better to add OS info to MECAB_DICTIONARY_NAMES
for dictionary selection if possible. Also, I'm curious about how mecab-python3
works as it also requires mecab-ipadic-utf8
.
Hi, adding 'mecab-ipadic' doesn't work but changing line 53 to return MeCab.Tagger("-Ochasen")
works.
I'm sorry, I don't understand what that's doing.
@rspeer I just use the suggested dictionary setting of Mecab, referring to this repo https://github.com/SamuraiT/mecab-python3
Oh I get it: if we leave out the dictionary, MeCab will use whatever Japanese dictionary it prefers.
Unfortunately, on Ubuntu, the default Japanese dictionary is mecab-jumandic-utf8, not mecab-ipadic-utf8. So if jumandic is installed, this would appear to work but give results that aren't compatible with wordfreq's vocabulary.
This is hard. I would like to make this work on multiple flavors of Linux and make sure that it never silently gives wrong results, and I don't know how to test that part. Part of the problem is that I don't have a way to test what happens on CentOS.
I really appreciate your contribution to wordfreq
, to test on CentOS, probably you can use a docker image of CentOS.
Here is a workaround for Fedora users like me:
$ sudo dnf install mecab-ipadic
$ sudo mkdir /usr/lib/mecab/dic/ -p
$ ln -s /usr/lib64/mecab/dic/ipadic /usr/lib/mecab/dic/mecab-ipadic-utf8
Hello, I'm the maintainer of mecab-python3.
Starting at the end of 2018, mecab-python3 began distributing wheels for Linux that included a bundled IPADic. So at that point the install of a system MeCab binary and dictionary was unecessary, and would in fact have been ignored. (That covers all posts in this thread, for the record.)
Starting with v1 last summer, wheels are available for Windows, OSX, and Linux, but the dictionary is bundled as an extra package. You seem to have already addressed this by depending on the ipadic package on PyPI, but your setup.py
says a system mecab-dev package install is required, which is not the case.
If you have any issues with MeCab please feel free to @ me at any time.
As a side note, I notice you packaged mecab-ko-dic for use with pypi. I would recommend getting someone who speaks Korean to check the output of that - the main thing I understand about that dictionary is that it's intended for use with a patched MeCab binary that has special treatment of spaces. However I don't speak Korean and have never seen an English explanation of this so I haven't been able to figure out the details.
Hello polm, thanks for the info here.
The release of version 2.4.2 is, as you've noticed, intended to resolve the long-standing difficulty of dealing with mecab's dependencies. This bug can be closed, except I'll leave it open for a bit because we're having this conversation here.
I didn't know that the default behavior of mecab-python3 itself had changed to use ipadic
(we were overriding the default dictionary path before, so that we could look for multiple dictionaries). And it's true that I left a comment behind in setup.py that no longer applies.
I'm going to start looking into the effects of the Korean dictionary. If there's a problem with handling spaces, it may not appear in typical use of wordfreq -- where the inputs usually come from another dictionary or tokenizer, and wouldn't include internal spaces -- but it would affect exquisite-corpus, the build pipeline that generates the word frequencies in wordfreq.
So if there's a problem, it would show up as unusual word frequencies in the data. Again, I'll look into it.