SymSpell LookupCompound excluding Numbers and Special characters
I'm trying to use SymSpell for OCR post processing spell correction. I have noticed that, SymSpell LookupCompound excluding Numbers and Special characters from the output. In my context, numbers and characters are really important for further analysis. Is it possible to avoid Numbers and Special characters elimination?
Version: SymSpell 6.3 C# project
Steps to reproduce:
-
Build the SymSpell C# code
-
Go to \SymSpell\SymSpell.CompoundDemo
-
Run dotnet run .
-
Enter below input "To find out more about how we use information, visit or contact-any of our offices 24/7"
-
It gives below output. to find out more about how we use information visit or contact any of our offices of 5 30,646,750
Problem: We can notice that, the output doesn't contain ',' and 24/7
Expected Behavior to find out more about how we use information, visit or contact any of our offices 24/7
Punctuation: When parsing the input string into separate terms in line 773 string[] termList1 = ParseWords(input); the punctuation characters (like ',') between words could be preserved (currently they are discarded) and stored in a separate array, possibly also upper/lower-case information of the words. After the correction the result is created from the separate suggesionParts in line 894. At this point the suggestionParts could be recombined with the preserved punctuation characters and case information.
Numbers: Currently 24/7 is treated as two separate terms: 24 and 7 (/ is treated as punctuation and discarded). As there are no numbers in the included dictionary the to terms "24" "7" are "corrected" into "of" "a". Either you add numbers to the dictionary or you remove and preserve all numbers during the parsing in line 773 and later re-combine.
I will add this feature later this year.
Hi @wolfgarbe is this function implimented?
I will add this feature later this year.
Not yet.
Hi @wolfgarbe is this function implemented?
I think it is vitally important not to remove numbers.
Hi @wolfgarbe
Are these functionalities implemented?
Is this feature available yet?
Try this
from absl import app from absl import flags from symspellpy import SymSpell,Verbosity import pkg_resources import re import pdb
flags.DEFINE_string("test_sentence", "If the extracted string less less than 50 characters long, and is not sentence-terminated, then we assume that it is a header." , "sample test sentence") flags.DEFINE_string("test_sentence1", "itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness" , "sample test sentence") flags.DEFINE_string("filename", 'sample1.txt', "filename")
FLAGS = flags.FLAGS
class SpellChecker(object): def init(self, edit_distance_max = 2, prefix_length = 7): self.dictionary_path = pkg_resources.resource_filename("symspellpy" , "frequency_dictionary_en_82_765.txt") self.sym_spell = SymSpell(edit_distance_max, prefix_length) self.sym_spell.load_dictionary(self.dictionary_path, 0, 1) self.edit_distance_max = edit_distance_max
def do_symspell(self, sentence):
endswith_dot = False
if sentence.endswith('.'):
sentence = sentence[:-1]
endswith_dot = True
for word in sentence.split():
if re.search("[^a-zA-Z]", word):
if word not in self.sym_spell._words:
self.sym_spell._words[word] =1
else:
self.sym_spell._words[word] +=1
results = self.sym_spell.lookup_compound(sentence,
max_edit_distance=self.edit_distance_max , transfer_casing = True, ignore_non_words= True, split_phrase_by_space= True, ignore_term_with_digits=True) sentence = sentence if not results else results[0].term return sentence+"." if endswith_dot else sentence
def do_word_segmentation(self, sentence):
results = self.sym_spell.word_segmentation(sentence)
return results.corrected_string
def main(_): spell_checker_obj = SpellChecker() # with open(FLAGS.filename) as f: # sentences = f.read().splitlines() # for sentence in sentences: # print("Prev: %s"%sentence) # print("After: %s"%spell_checker_obj.do_symspell(sentence))
print(spell_checker_obj.do_symspell(
"“I see it, I deduce it. How do I know that you have been getting yourself very wet lately, and that you have a most clumsy and careless servent girl?”" )) # print(spell_checker_obj.do_word_segmentation(FLAGS.test_sentence1))
if name == 'main': app.run(main)
On Tue, May 19, 2020 at 7:06 AM Soumya [email protected] wrote:
Hello. Is there an option in symspell where you can skip a list of keywords, rather than defining a regex for the purpose? Also, is there a way to detect language using symspell?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wolfgarbe/SymSpell/issues/34#issuecomment-630803908, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN7CVKX6PHZZWZ3FSUBJIXDRSJ745ANCNFSM4FHBJBPQ .
--
This correspondence may contain personal or confidential information. If you are not the intended recipient, please delete the e-mail and any attachments and notify London Hydro immediately.