lmsa icon indicating copy to clipboard operation
lmsa copied to clipboard

hunspell port to Ruby

Open zverok opened this issue 6 years ago • 2 comments

Project

Port Hunspell opensource spellchecker to pure Ruby.

Proposed code name: spelleology.

Plan

  1. Understand hunspell dictionaries format.
  2. Create Hunspell dictionary reader, using Hunspell's code & docs as a reference and its dictionary samples
  3. Create simplistic spell-checking solution (split text into words → remove punctuations → run against dictionary)
  4. Wrap into proper Ruby gem, with executable and library usage (ver. ~0.0.1)
  5. Further development directions:
    • profiling and optimization
    • CI-readiness (different output formats, Rake task)
    • supplementary tools (dictionary downloader from OO repository)
    • pluggable integration with Markdown parsers and other markups, for proper reporting of spelling problems positions in marked files.

Importance

Hunspell is currently the most popular open source spellchecking tool, having most of the actual dictionaries in its format. But the tool itself is pretty complicated C++ software, that is hard to integrate and use from Ruby.

Pure-Ruby Hunspell port can be easily integrated with other Ruby tools, like Markdown parsers (or even Ruby parser, imagine you can spellcheck your Rake task descriptions?), Jekyll, CI tools and so on.

Skills and domains

You'll need to be able to at least read C++ of hunspell's sources. And expect a lot of optimization practice.

zverok avatar Feb 23 '18 16:02 zverok

Hey there! First of all, thanks for creating lmsa - it really helps to get something interesting to do!

Speaking of the issue itself: there's https://github.com/segabor/Hunspell gem, which is a ruby wrapper on top of native library. What are the advantages of having pure ruby implementation? I only see disadvantages: slower and outdated by definition (need to maintain to catch up with original library)..

nattfodd avatar Feb 26 '18 11:02 nattfodd

@nattfodd The one thing is, the most important part of hunspell is its dictionaries. Having pure-Ruby access to them, there are a lot of things that could be done with this morphological information, besides just is_it_a_correct_word?(word) and suggest_spell_check(word) most of existing hunspell wrappers provide.

Another thing is it is always nice to have pure equivalents of important libraries (if it is not an incredible amount of work, and it is not), for when dependencies (libhunspell) can't be properly installed our outdated.

And finally, the task, in fact, is pretty simple, and having nice clean "reference implementation" in high-level language could be beneficial not only for Ruby community. BTW, hunspell itself is not very stable and well-documented software, and currently undergoes a huge redesign, so having it as a "single point of truth" for main OSS spell checking is sub-optimal, let's say.

PS: one funny link

zverok avatar Feb 26 '18 11:02 zverok