spell checker could point out bad words that appear exactly once.

Open jnweiger opened this issue 11 years ago • 1 comments

Words that appear exactly once in document are more likely a typo, the larger the document.

The attached patch attempts to point out such singletons that are not recognized by the spell checker.

Issues with the patch:

it shows false positives. e.g. Ceph appears multiple times in the cloud 3 deployment guide, but is in the singleton list.
it includes many word fragments, as we currently do not handle hyphenation well.

[ouch, is there no way to attach a file here?]

Jan 28 '14 19:01 jnweiger

--- /suse/jw/src/github/pdfcompare/pdf_highlight.py 2014-01-07 15:28:01.000000000 +0100 +++ /usr/bin/pdfcompare 2014-01-28 19:54:19.604902143 +0100 @@ -88,6 +88,8 @@

later on. Strange.

2014-01-07, V1.6.5 jw - manually merged https://github.com/jnweiger/pdfcompare/pull/4

hope, I did not break too much...

+# 2014-01-28, V1.6.6 jw - --spell now prints out a word list of non-dictionary words seen +# exaclty once.

osc in devel:languages:python python-pypdf >= 1.13+20130112

need fix from https://bugs.launchpad.net/pypdf/+bug/242756

@@ -1119,7 +1121,8 @@ each page, giving the exact coordinates of the bounding box of all occurances. Font metrics are used to interpolate into the line fragments found in the dom tree.

Keys and values from ext['e'] are merged into the DecoratedWord output for pattern matches and spell check findings.

Keys and values from ext['e'] are merged into the DecoratedWord output for pattern

```
matches and spell check findings.
```
If re_pattern is None, then wordlist is used instead. Keys and values from ext['a'], ext['d'], or ext['c'] respectively are merged into the DecoratedWord output for added, deleted, or changed texts (respectivly). @@ -1218,7 +1221,7 @@

print("SequenceMatcher done") # this means nothing... s.get_opcodes() takes ages!

def opcodes_find_moved(iter_list):

 """ adds a 6th element to the yielded tuple, which holds refernces between

 """ adds a 6th element to the yielded tuple, which holds references between 
   'delete's and 'insert's. Example hint = { 'ref': [ (start_idx, similarity), ...] }
   A similarity of 1.0 means "identical"; a similarity of 0.0 has nothing in common.
   The elements in the ref list are sorted by decreasing similarity.

@@ -1392,19 +1395,33 @@ if spell_check: h = Hunspell(dicts=None) word_set = set()

singularity_word_set = set() for word in wl_new: m = re.search('([a-z_-]{3,})', word[0], re.I) if m:

preserve capitalization. hunspell handles that nicely.

stem = m.group(1) word_set.add(stem)

   if not 's' in word[3]: word[3]['s'] = {}

```
   if not 's' in word[3]: 
```
```
    word[3]['s'] = {}
```
```
    singularity_word_set.add(stem)
```
```
   elif stem in singularity_word_set:
```
```
    singularity_word_set.remove(stem)
 word[3]['s'][word[2]] = stem
```
print("%d words to check" % len(word_set)) bad_word_dict = h.check_words(word_set) bad_word_dict = h.check_words(word_set) print("checked: %d bad" % len(bad_word_dict)) if debug > 1: pprint(['bad_word_dict: ', bad_word_dict]) +
bad_singularity_word_set = set()
for word in bad_word_dict:
```
 if word in singularity_word_set:
```
```
   bad_singularity_word_set.add(word)
```
not really useful. Too many false positives due to not recognized hyphenations.
if debug:

   pprint([len(bad_singularity_word_set), 'bad_singularities: ', bad_singularity_word_set])

idx = 0 for word in wl_new:

Jan 28 '14 19:01 jnweiger

spell checker could point out bad words that appear exactly once.

later on. Strange.

2014-01-07, V1.6.5 jw - manually merged https://github.com/jnweiger/pdfcompare/pull/4

hope, I did not break too much...

osc in devel:languages:python python-pypdf >= 1.13+20130112

need fix from https://bugs.launchpad.net/pypdf/+bug/242756

print("SequenceMatcher done") # this means nothing... s.get_opcodes() takes ages!

preserve capitalization. hunspell handles that nicely.

not really useful. Too many false positives due to not recognized hyphenations.