pdfcompare icon indicating copy to clipboard operation
pdfcompare copied to clipboard

spell checker could point out bad words that appear exactly once.

Open jnweiger opened this issue 11 years ago • 1 comments

Words that appear exactly once in document are more likely a typo, the larger the document.

The attached patch attempts to point out such singletons that are not recognized by the spell checker.

Issues with the patch:

  • it shows false positives. e.g. Ceph appears multiple times in the cloud 3 deployment guide, but is in the singleton list.
  • it includes many word fragments, as we currently do not handle hyphenation well.

[ouch, is there no way to attach a file here?]

jnweiger avatar Jan 28 '14 19:01 jnweiger

--- /suse/jw/src/github/pdfcompare/pdf_highlight.py 2014-01-07 15:28:01.000000000 +0100 +++ /usr/bin/pdfcompare 2014-01-28 19:54:19.604902143 +0100 @@ -88,6 +88,8 @@

later on. Strange.

2014-01-07, V1.6.5 jw - manually merged https://github.com/jnweiger/pdfcompare/pull/4

hope, I did not break too much...

+# 2014-01-28, V1.6.6 jw - --spell now prints out a word list of non-dictionary words seen +# exaclty once.

osc in devel:languages:python python-pypdf >= 1.13+20130112

need fix from https://bugs.launchpad.net/pypdf/+bug/242756

@@ -1119,7 +1121,8 @@ each page, giving the exact coordinates of the bounding box of all occurances. Font metrics are used to interpolate into the line fragments found in the dom tree.

  • Keys and values from ext['e'] are merged into the DecoratedWord output for pattern matches and spell check findings.
    
  • Keys and values from ext['e'] are merged into the DecoratedWord output for pattern 
    
  • matches and spell check findings.
    

    If re_pattern is None, then wordlist is used instead. Keys and values from ext['a'], ext['d'], or ext['c'] respectively are merged into the DecoratedWord output for added, deleted, or changed texts (respectivly). @@ -1218,7 +1221,7 @@

    print("SequenceMatcher done") # this means nothing... s.get_opcodes() takes ages!

    def opcodes_find_moved(iter_list):

  •  """ adds a 6th element to the yielded tuple, which holds refernces between 
    
  •  """ adds a 6th element to the yielded tuple, which holds references between 
       'delete's and 'insert's. Example hint = { 'ref': [ (start_idx, similarity), ...] }
       A similarity of 1.0 means "identical"; a similarity of 0.0 has nothing in common.
       The elements in the ref list are sorted by decreasing similarity.
    

    @@ -1392,19 +1395,33 @@ if spell_check: h = Hunspell(dicts=None) word_set = set()

  • singularity_word_set = set() for word in wl_new: m = re.search('([a-z_-]{3,})', word[0], re.I) if m:

    preserve capitalization. hunspell handles that nicely.

    stem = m.group(1) word_set.add(stem)

  •    if not 's' in word[3]: word[3]['s'] = {}
    
  •    if not 's' in word[3]: 
    
  •     word[3]['s'] = {}
    
  •     singularity_word_set.add(stem)
    
  •    elif stem in singularity_word_set:
    
  •     singularity_word_set.remove(stem)
     word[3]['s'][word[2]] = stem
    

    print("%d words to check" % len(word_set)) bad_word_dict = h.check_words(word_set) bad_word_dict = h.check_words(word_set) print("checked: %d bad" % len(bad_word_dict)) if debug > 1: pprint(['bad_word_dict: ', bad_word_dict]) +

  • bad_singularity_word_set = set()

  • for word in bad_word_dict:

  •  if word in singularity_word_set:
    
  •    bad_singularity_word_set.add(word)
    
  • not really useful. Too many false positives due to not recognized hyphenations.

  • if debug:

  •    pprint([len(bad_singularity_word_set), 'bad_singularities: ', bad_singularity_word_set])
    

    idx = 0 for word in wl_new:

jnweiger avatar Jan 28 '14 19:01 jnweiger