quanteda icon indicating copy to clipboard operation
quanteda copied to clipboard

Tidy up developer functions

Open kbenoit opened this issue 5 years ago • 3 comments

These have changed a lot recently and I want to get my head clearly around these functions, and how we package them together and document them. I'm starting this issue to flag it but will continue to develop the notes here.

Functions affected:

  • object2id()
  • object2fixed()
  • pattern2id()
  • pattern2fixed()
  • index_types()
  • index() (aka locate())

kbenoit avatar Feb 26 '21 11:02 kbenoit

Should also address #2062

kbenoit avatar Mar 03 '21 09:03 kbenoit

So what I meant in this PR is that especially for object2*() and pattern2*() functions, these are important building blocks of our functionality that could be useful by other developers (or by our future selves or other quanteda developers). These function in similar ways but it's not clear which should be used when.

NOTE: We don't necessarily need these tidied up before v3 release, since they are internal, but I think that tidying them up could help met the goal you expressed of promoting core functions for developers. For instance if we create a developer vignette and talk about some of our internal functions and structures.

> pattern <- list(c("^a$", "^b"), c("c"), c("d"))
> types <- c("A", "AA", "B", "BB", "BBB", "C", "CC")
> pattern2fixed(pattern, types, "regex", case_insensitive = TRUE)
[[1]]
[1] "A" "B"

[[2]]
[1] "A"  "BB"

[[3]]
[1] "A"   "BBB"

[[4]]
[1] "C"

[[5]]
[1] "CC"

> object2fixed(pattern, types, "regex", case_insensitive = TRUE)
$`^a$ ^b`
[1] "A" "B"

$`^a$ ^b`
[1] "A"  "BB"

$`^a$ ^b`
[1] "A"   "BBB"

$c
[1] "C"

$c
[1] "CC"

I wonder why we do not consolidate them in pattern2*() since the input objects are also valid inputs listed in ?pattern.

Also the 2id functions are like an lapply(match(). The return for ?match():

match: An integer vector giving the position in table of the first match if there is a match, otherwise nomatch. Would it make more sense to describe the function more this way? and potentially name it to reflect the similarity with match?

kbenoit avatar Mar 20 '21 13:03 kbenoit

object2*() takes various objects like dictionary and collocations. It depends on pattern2*().

*2id is the underlying function that returns positions in the type vector, so pattern2id is the mother of all the functions.

koheiw avatar Mar 20 '21 14:03 koheiw