spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Installation issue on old macOSes for new Korean tokenizer in v4.0 alpha

Open BLKSerene opened this issue 2 years ago • 1 comments

Hi, I noticed from #12328 that spaCy has switched to pymecab-ko for the Korean tokenizer in the upcoming spaCy 4.0, but there seems to be some installation/import issues of this package on macOSes (cf. pymecab-ko/#5).

I've tried on OS X 10.11 that python-mecab-ko, the alternative mentioned in #12328, could be successfully compiled, installed, and imported. I'm wondering that whether it is possible to add this as another alternative for the Korean tokenizer in spaCy 4.0?

Your Environment

  • Operating System: Windows 11 x64, OS X 10.11
  • Python Version Used: 3.9.16
  • spaCy Version Used: 4.0 alpha

BLKSerene avatar Mar 14 '23 08:03 BLKSerene

Thanks for the note! It does look like the package python-mecab-ko has had a better set of published wheels since their updates in December. We will evaluate it and consider switching to python-mecab-ko for spacy v4 or at least adding it as an alternative.

You can always write a short custom tokenizer if you need one, the code would look similar to this:

https://github.com/explosion/spaCy/blob/520279ff7c9af199928e2a727999162cb79c38a3/spacy/lang/ko/init.py#L25-L75

adrianeboyd avatar Mar 14 '23 15:03 adrianeboyd