SymSpell SymSpell LookupCompound excluding Numbers and Special characters

I'm trying to use SymSpell for OCR post processing spell correction. I have noticed that, SymSpell LookupCompound excluding Numbers and Special characters from the output. In my context, numbers and characters are really important for further analysis. Is it possible to avoid Numbers and Special characters elimination?

Version: SymSpell 6.3 C# project

Steps to reproduce:

Build the SymSpell C# code
Go to \SymSpell\SymSpell.CompoundDemo
Run dotnet run .
Enter below input "To find out more about how we use information, visit or contact-any of our offices 24/7"
It gives below output. to find out more about how we use information visit or contact any of our offices of 5 30,646,750

Problem: We can notice that, the output doesn't contain ',' and 24/7

Expected Behavior to find out more about how we use information, visit or contact any of our offices 24/7

Jun 26 '18 16:06 geomygeorge

Punctuation: When parsing the input string into separate terms in line 773 string[] termList1 = ParseWords(input); the punctuation characters (like ',') between words could be preserved (currently they are discarded) and stored in a separate array, possibly also upper/lower-case information of the words. After the correction the result is created from the separate suggesionParts in line 894. At this point the suggestionParts could be recombined with the preserved punctuation characters and case information.

Numbers: Currently 24/7 is treated as two separate terms: 24 and 7 (/ is treated as punctuation and discarded). As there are no numbers in the included dictionary the to terms "24" "7" are "corrected" into "of" "a". Either you add numbers to the dictionary or you remove and preserve all numbers during the parsing in line 773 and later re-combine.

I will add this feature later this year.

Jun 26 '18 18:06 wolfgarbe

Hi @wolfgarbe is this function implimented?

I will add this feature later this year.

Oct 25 '18 04:10 trungkiendang

Not yet.

Oct 25 '18 07:10 wolfgarbe

Hi @wolfgarbe is this function implemented?

Feb 13 '19 06:02 Prashant118

I think it is vitally important not to remove numbers.

Feb 13 '19 16:02 fahadshery

Hi @wolfgarbe

Are these functionalities implemented?

Jul 23 '19 06:07 hardiksanchawat

Is this feature available yet?

May 02 '20 08:05 islama-lh

Try this

from absl import app from absl import flags from symspellpy import SymSpell,Verbosity import pkg_resources import re import pdb

flags.DEFINE_string("test_sentence", "If the extracted string less less than 50 characters long, and is not sentence-terminated, then we assume that it is a header." , "sample test sentence") flags.DEFINE_string("test_sentence1", "itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness" , "sample test sentence") flags.DEFINE_string("filename", 'sample1.txt', "filename")

FLAGS = flags.FLAGS

class SpellChecker(object): def init(self, edit_distance_max = 2, prefix_length = 7): self.dictionary_path = pkg_resources.resource_filename("symspellpy" , "frequency_dictionary_en_82_765.txt") self.sym_spell = SymSpell(edit_distance_max, prefix_length) self.sym_spell.load_dictionary(self.dictionary_path, 0, 1) self.edit_distance_max = edit_distance_max

def do_symspell(self, sentence):
    endswith_dot = False
    if sentence.endswith('.'):
        sentence = sentence[:-1]
        endswith_dot = True
    for word in sentence.split():
        if re.search("[^a-zA-Z]", word):
            if word not in self.sym_spell._words:
                self.sym_spell._words[word] =1
            else:
                self.sym_spell._words[word] +=1


    results = self.sym_spell.lookup_compound(sentence,

max_edit_distance=self.edit_distance_max , transfer_casing = True, ignore_non_words= True, split_phrase_by_space= True, ignore_term_with_digits=True) sentence = sentence if not results else results[0].term return sentence+"." if endswith_dot else sentence

def do_word_segmentation(self, sentence):
    results = self.sym_spell.word_segmentation(sentence)
    return results.corrected_string

def main(_): spell_checker_obj = SpellChecker() # with open(FLAGS.filename) as f: # sentences = f.read().splitlines() # for sentence in sentences: # print("Prev: %s"%sentence) # print("After: %s"%spell_checker_obj.do_symspell(sentence))

print(spell_checker_obj.do_symspell(

"“I see it, I deduce it. How do I know that you have been getting yourself very wet lately, and that you have a most clumsy and careless servent girl?”" )) # print(spell_checker_obj.do_word_segmentation(FLAGS.test_sentence1))

if name == 'main': app.run(main)

On Tue, May 19, 2020 at 7:06 AM Soumya [email protected] wrote:

Hello. Is there an option in symspell where you can skip a list of keywords, rather than defining a regex for the purpose? Also, is there a way to detect language using symspell?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wolfgarbe/SymSpell/issues/34#issuecomment-630803908, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN7CVKX6PHZZWZ3FSUBJIXDRSJ745ANCNFSM4FHBJBPQ .

--

This correspondence may contain personal or confidential information. If you are not the intended recipient, please delete the e-mail and any attachments and notify London Hydro immediately.

May 19 '20 13:05 islama-lh