spaCy
spaCy copied to clipboard
Installation issue on old macOSes for new Korean tokenizer in v4.0 alpha
Hi, I noticed from #12328 that spaCy has switched to pymecab-ko for the Korean tokenizer in the upcoming spaCy 4.0, but there seems to be some installation/import issues of this package on macOSes (cf. pymecab-ko/#5).
I've tried on OS X 10.11 that python-mecab-ko, the alternative mentioned in #12328, could be successfully compiled, installed, and imported. I'm wondering that whether it is possible to add this as another alternative for the Korean tokenizer in spaCy 4.0?
Your Environment
- Operating System: Windows 11 x64, OS X 10.11
- Python Version Used: 3.9.16
- spaCy Version Used: 4.0 alpha
Thanks for the note! It does look like the package python-mecab-ko has had a better set of published wheels since their updates in December. We will evaluate it and consider switching to python-mecab-ko for spacy v4 or at least adding it as an alternative.
You can always write a short custom tokenizer if you need one, the code would look similar to this:
https://github.com/explosion/spaCy/blob/520279ff7c9af199928e2a727999162cb79c38a3/spacy/lang/ko/init.py#L25-L75