MorphMan icon indicating copy to clipboard operation
MorphMan copied to clipboard

Hunspell dictionaries

Open chickendude opened this issue 5 years ago • 3 comments

For many "space" languages there are decent Hunspell dictionaries available. I used PyHunSpell to return the word's stem. I've only tested it with Basque but have been using it for a little while now (just pushed my changes to a fork) and i see far fewer duplicate (inflected) words than before.

I'm not sure how you'd set it up on Windows as it requires setting up Hunspell and PyHunSpell first, so I'm not sure how viable it is for most folks, but in my opinion it's a nice improvement for languages which rely more on inflection. Not sure if this is something worth looking into for other folks to use as well.

chickendude avatar Sep 29 '19 19:09 chickendude

I'd be interested in looking into this once we have a nice test suite (because then I can ensure that adding this feature doesn't negatively impact the parsing). However, as explained in other issues, I simply do not know how to add a test suite currently...

shanrauf avatar Dec 25 '19 13:12 shanrauf

@chickendude I looked into Hunspell and realized it had nothing to do with test cases since it's essentially a set of totally new parsers, so ignore what I was saying up top (although I got a test suite working anyway). It looks like you need to install dictionaries to use PyHunSpell, which is fine, but we don't want to include all of those dictionaries in Morphman. What we should do is keep the space morphemizer, and then we can add a morphemizer for every language that has one of those hunspell dictionaries, but either users will have to drag-drop dictionaries into their addon, or we download the dictionary for them through Anki when they select that morphemizer/parser. Including all of the dictionaries up front would make the addon far too large (and I already made the addon larger in #60 by directly adding mecab as a dependency because it made testing easier). What do you think?

shanrauf avatar Dec 27 '19 14:12 shanrauf

Yeah, that sounds like a good alternative, though to use PyHunSpell you also need to install Hunspell as it's basically just a wrapper for Hunspell. It's relatively straightforward to build on Linux (and i would assume on Mac as well), i'm not sure how it'd be on Windows as i haven't ever tried setting it up. But being able to download/import dictionaries would be great, i agree that including the dictionaries directly wouldn't make sense as it just makes the add-on unnecessarily large.

chickendude avatar Jan 01 '20 17:01 chickendude