JamSpell
                                
                                
                                
                                    JamSpell copied to clipboard
                            
                            
                            
                        Modern spell checking library - accurate, fast, multi-language
JamSpell
JamSpell is a spell checking library with following features:
- accurate - it considers words surroundings (context) for better correction
 - fast - near 5K words per second
 - multi-language - it's written in C++ and available for many languages with swig bindings
 
JamSpellPro
jamspell.com - check out a new jamspell version with following features
- Improved accuracy (catboost gradient boosted decision trees candidates ranking model)
 - Splits merged words
 - Pre-trained models for many languages (small, medium, large) for:
en, ru, de, fr, it, es, tr, uk, pl, nl, pt, hi, no - Ability to add words / sentences at runtime
 - Fine-tuning / additional training
 - Memory optimization for training large models
 - Static dictionary support
 - Built-in 
Java, C#, Rubysupport - Windows support
 
Content
- Benchmarks
 - Usage
- Python
 - C++
 - Other languages
 - HTTP API
 
 - Train
 
Benchmarks
| Errors | Top 7 Errors | Fix Rate | Top 7 Fix Rate | Broken | Speed (words/second)  | 
  |
| JamSpell | 3.25% | 1.27% | 79.53% | 84.10% | 0.64% | 4854 | 
| Norvig | 7.62% | 5.00% | 46.58% | 66.51% | 0.69% | 395 | 
| Hunspell | 13.10% | 10.33% | 47.52% | 68.56% | 7.14% | 163 | 
| Dummy | 13.14% | 13.14% | 0.00% | 0.00% | 0.00% | - | 
Model was trained on 300K wikipedia sentences + 300K news sentences (english). 95% was used for train, 5% was used for evaluation. Errors model was used to generate errored text from the original one. JamSpell corrector was compared with Norvig's one, Hunspell and a dummy one (no corrections).
We used following metrics:
- Errors - percent of words with errors after spell checker processed
 - Top 7 Errors - percent of words missing in top7 candidated
 - Fix Rate - percent of errored words fixed by spell checker
 - Top 7 Fix Rate - percent of errored words fixed by one of top7 candidates
 - Broken - percent of non-errored words broken by spell checker
 - Speed - number of words per second
 
To ensure that our model is not too overfitted for wikipedia+news we checked it on "The Adventures of Sherlock Holmes" text:
| Errors | Top 7 Errors | Fix Rate | Top 7 Fix Rate | Broken | Speed (words per second) | |
| JamSpell | 3.56% | 1.27% | 72.03% | 79.73% | 0.50% | 5524 | 
| Norvig | 7.60% | 5.30% | 35.43% | 56.06% | 0.45% | 647 | 
| Hunspell | 9.36% | 6.44% | 39.61% | 65.77% | 2.95% | 284 | 
| Dummy | 11.16% | 11.16% | 0.00% | 0.00% | 0.00% | - | 
More details about reproducing available in "Train" section.
Usage
Python
- 
Install
swig3(usually it is in your distro package manager) - 
Install
jamspell: 
pip install jamspell
- 
Download or train language model
 - 
Use it:
 
import jamspell
corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('en.bin')
corrector.FixFragment('I am the begt spell cherken!')
# u'I am the best spell checker!'
corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 3)
# (u'best', u'beat', u'belt', u'bet', u'bent', ... )
corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 5)
# (u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)
C++
- 
Add
jamspellandcontribdirs to your project - 
Use it:
 
#include <jamspell/spell_corrector.hpp>
int main(int argc, const char** argv) {
    NJamSpell::TSpellCorrector corrector;
    corrector.LoadLangModel("model.bin");
    corrector.FixFragment(L"I am the begt spell cherken!");
    // "I am the best spell checker!"
    corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
    // "best", "beat", "belt", "bet", "bent", ... )
    corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
    // "checker", "chicken", "checked", "wherein", "coherent", ... )
    return 0;
}
Other languages
You can generate extensions for other languages using swig tutorial. The swig interface file is jamspell.i. Pull requests with build scripts are welcome.
HTTP API
- 
Install
cmake - 
Clone and build jamspell (it includes http server):
 
git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make
- Download or train language model
 - Run http server:
 
./web_server/web_server en.bin localhost 8080
- GET Request example:
 
$ curl "http://localhost:8080/fix?text=I am the begt spell cherken"
I am the best spell checker
- POST Request example
 
$ curl -d "I am the begt spell cherken" http://localhost:8080/fix
I am the best spell checker
- Candidate example
 
curl "http://localhost:8080/candidates?text=I am the begt spell cherken"
# or
curl -d "I am the begt spell cherken" http://localhost:8080/candidates
{
    "results": [
        {
            "candidates": [
                "best",
                "beat",
                "belt",
                "bet",
                "bent",
                "beet",
                "beit"
            ],
            "len": 4,
            "pos_from": 9
        },
        {
            "candidates": [
                "checker",
                "chicken",
                "checked",
                "wherein",
                "coherent",
                "cheered",
                "cherokee"
            ],
            "len": 7,
            "pos_from": 20
        }
    ]
}
Here pos_from - misspelled word first letter position, len - misspelled word len
Train
To train custom model you need:
- 
Install
cmake - 
Clone and build jamspell:
 
git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make
- 
Prepare a utf-8 text file with sentences to train at (eg.
sherlockholmes.txt) and another file with language alphabet (eg.alphabet_en.txt) - 
Train model:
 
./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin
- To evaluate spellchecker you can use 
evaluate/evaluate.pyscript: 
python evaluate/evaluate.py -a alphabet_file.txt -jsp your_model.bin -mx 50000 your_test_data.txt
- You can use 
evaluate/generate_dataset.pyto generate you train/test data. It supports txt files, Leipzig Corpora Collection format and fb2 books. 
Download models
Here is a few simple models. They trained on 300K news + 300k wikipedia sentences. We strongly recommend to train your own model, at least on a few million sentences to achieve better quality. See Train section above.