spell checker could point out bad words that appear exactly once.
Words that appear exactly once in document are more likely a typo, the larger the document.
The attached patch attempts to point out such singletons that are not recognized by the spell checker.
Issues with the patch:
- it shows false positives. e.g. Ceph appears multiple times in the cloud 3 deployment guide, but is in the singleton list.
- it includes many word fragments, as we currently do not handle hyphenation well.
[ouch, is there no way to attach a file here?]
--- /suse/jw/src/github/pdfcompare/pdf_highlight.py 2014-01-07 15:28:01.000000000 +0100 +++ /usr/bin/pdfcompare 2014-01-28 19:54:19.604902143 +0100 @@ -88,6 +88,8 @@
later on. Strange.
2014-01-07, V1.6.5 jw - manually merged https://github.com/jnweiger/pdfcompare/pull/4
hope, I did not break too much...
+# 2014-01-28, V1.6.6 jw - --spell now prints out a word list of non-dictionary words seen +# exaclty once.
osc in devel:languages:python python-pypdf >= 1.13+20130112
need fix from https://bugs.launchpad.net/pypdf/+bug/242756
@@ -1119,7 +1121,8 @@ each page, giving the exact coordinates of the bounding box of all occurances. Font metrics are used to interpolate into the line fragments found in the dom tree.
-
Keys and values from ext['e'] are merged into the DecoratedWord output for pattern matches and spell check findings. -
Keys and values from ext['e'] are merged into the DecoratedWord output for pattern -
matches and spell check findings.If re_pattern is None, then wordlist is used instead. Keys and values from ext['a'], ext['d'], or ext['c'] respectively are merged into the DecoratedWord output for added, deleted, or changed texts (respectivly). @@ -1218,7 +1221,7 @@
print("SequenceMatcher done") # this means nothing... s.get_opcodes() takes ages!
def opcodes_find_moved(iter_list):
-
""" adds a 6th element to the yielded tuple, which holds refernces between -
""" adds a 6th element to the yielded tuple, which holds references between 'delete's and 'insert's. Example hint = { 'ref': [ (start_idx, similarity), ...] } A similarity of 1.0 means "identical"; a similarity of 0.0 has nothing in common. The elements in the ref list are sorted by decreasing similarity.@@ -1392,19 +1395,33 @@ if spell_check: h = Hunspell(dicts=None) word_set = set()
-
singularity_word_set = set() for word in wl_new: m = re.search('([a-z_-]{3,})', word[0], re.I) if m:
preserve capitalization. hunspell handles that nicely.
stem = m.group(1) word_set.add(stem)
-
if not 's' in word[3]: word[3]['s'] = {} -
if not 's' in word[3]: -
word[3]['s'] = {} -
singularity_word_set.add(stem) -
elif stem in singularity_word_set: -
singularity_word_set.remove(stem) word[3]['s'][word[2]] = stemprint("%d words to check" % len(word_set)) bad_word_dict = h.check_words(word_set) bad_word_dict = h.check_words(word_set) print("checked: %d bad" % len(bad_word_dict)) if debug > 1: pprint(['bad_word_dict: ', bad_word_dict]) +
-
bad_singularity_word_set = set()
-
for word in bad_word_dict:
-
if word in singularity_word_set: -
bad_singularity_word_set.add(word) -
not really useful. Too many false positives due to not recognized hyphenations.
-
if debug:
-
pprint([len(bad_singularity_word_set), 'bad_singularities: ', bad_singularity_word_set])idx = 0 for word in wl_new: