spelling icon indicating copy to clipboard operation
spelling copied to clipboard

Support alphanumeric and hyphenated words

Open nuno-agostinho opened this issue 4 years ago • 2 comments

I am using the following words in my package:

  • RNA-seq
  • 1st
  • 2nd
  • EIF4G1

After inserting these words in inst/WORDLIST and running spelling::spell_check_package(), the function reports that the words seq, st, nd and EIF are misspelled.

Currently, my WORDLIST includes the words seq, st, nd and EIF to avoid triggering the spell checker, but I would prefer to include the full words. Thanks.

nuno-agostinho avatar Feb 03 '20 18:02 nuno-agostinho

I have the same issue, picked up with ordinal indicators. It looks like this is a problem with the hunspell parser:

hunspell::hunspell_parse(c("1st", "RNA-seq", "EIF4G1"))
#> [[1]]
#> [1] "st"
#> 
#> [[2]]
#> [1] "RNA" "seq"
#> 
#> [[3]]
#> [1] "EIF" "G"

Created on 2021-02-06 by the reprex package (v0.3.0)

jmbarbone avatar Feb 06 '21 22:02 jmbarbone

Implementing a pre filter right before the parse here could work:

https://github.com/ropensci/spelling/blob/a2b5f29856b6a067e33d45e29ae3aa4b88ed6176/R/check-files.R#L118-L123

It feels like more of a quick-fix because it parses with strsplit() then paste()s back together before being sent to the actual parsing function.

ignore_words <- c("1st", "RNA-seq", "EIF4G1")

lines <- c(
  "This is the 1st line.  It has first written in it.",
  "The second has RNA-seq inside. But does not use RNAseq -- without the '-'",
  "EIF4G1 but not EIF4G1fdsadf is used",
  "This line's words are fine!"
)

pre_filter_plain <- function(lines, ignore = character()) {
  word_list <- strsplit(lines, "([^-[:alnum:][:punct:]])")
  
  vapply(
    word_list,
    function(i) {
      paste(i[!i %in% ignore], collapse = " ")
    },
    character(1)
  )
}

pre_filter_plain(lines, ignore_words)
#> [1] "This is the line.  It has first written in it."                   
#> [2] "The second has inside. But does not use RNAseq -- without the '-'"
#> [3] "but not EIF4G1fdsadf is used"                                     
#> [4] "This line's words are fine!"

Created on 2021-02-06 by the reprex package (v0.3.0)

jmbarbone avatar Feb 07 '21 00:02 jmbarbone