Pyspellchecker corrects tokens that are not supposed to be corrected, and does not correct well some incorrect ones

Open ljpetkovic opened this issue 4 years ago • 1 comments

Hello, I am using pyspellchecker (French dictionary) in order to correct this file.

However, after running the script below, I have noticed two types of problems (output file):

certain tokens (initially correct) were modified incorrectly (e.g. collection is corrected to colletion)
some tokens (initially incorrect) were not corrected properly (e.g. MÉMOÎRES should be corrected to MÉMOIRES):

import re, glob
from spellchecker import SpellChecker

entry = "6228000_r.txt"
output = "6228000_r_corr.txt"

spell = SpellChecker(language='fr')
text = open(entry).read()

# do not correct the tokens containing the apostrophes (ex : l’empire, d’art, s’étend...)
r1 = re.findall(r"([lL]’\w+|[dD]’\w+|[sS]’\w+|[qQ]u’\w+|[cC]’\w+|[nN]’\w+|[jJ]’\w+|[Ll]orfqu’\w+|eft)",text)

# tokenise the text with the pyspellchecker tokeniser
tokens = spell.split_words(text)

spell.word_frequency.load_words(r1)
spell.known(r1)  # the words l’empire, d’art, s’étend etc. are now in the dictionary of known words

print(tokens)
misspelled = spell.unknown(tokens)

with open(output, "w") as f:
    for m in misspelled:
        corrected = spell.correction(m)
        text = text.replace(m, corrected)
        # f.write(c.replace('clafliques', 'classiques'))
    f.write(text)

I cleaned up the original .txt file by replacing the single quote (') with the apostrophe (’) in the words such as l’empire. I also tried to remove some other special characters (e.g. ^, &, <, >), but the errors persist, and I cannot seem to locate exactly what causes them.

Do you have any idea how to resolve this issue?

Jun 30 '21 15:06 ljpetkovic

It is likely an issue with the dictionary based on the data source used to build the dictionaries. I am not a French speaker and am not really able to validate the data in the dictionary. You can see the script used to build the dictionary here and there is a discussion on how it is done and how it could be improved in this discussion.

Any help on updating the dictionary would be helpful.

Jun 30 '21 18:06 barrust