feature request: fnobs() without altering other table columns

Open D3SL opened this issue 1 year ago • 1 comments

One of my workflows relies on counting unique values in one column grouped by another. Using data.table's .N I can get that count as a column without altering anything else. In collapse afaik this isn't possible, the closest approximate I've found is joining a table to a copy of itself.

In terms of performance the overhead seems minimal, only 9k microseconds at 100k rows of test data. The main issue is the human component. It's not a trivial amount of extra typing and the fselect and join are both pain points where errors could creep in.

Unit: microseconds
                  expr   min    lq   mean median    uq    max neval
                .N 10   572   580    626    588   600   2155   100
        fnobs join 10   387   396    475    409   501    880   100
      fnobs by itself    77    80    104     83    94    380   100
              .N 100k 63568 75143 104600  83375 93415 219051   100
      fnobs join 100k 35900 36406  40976  38768 45227  53712   100
 fnobs by itself 100k 29286 29478  32771  29755 32207 142688   100

code to produce benchmark:

library(data.table)
library(collapse)

test<-data.table(
  idcol=c(rep("foo",3),rep("bar",3),rep("baz",3),"lonesome george"),
  filler=letters[1:10],
  target=c(1,2,3,1,1,2,1,1,1,1)
)
test2<-copy(test)
test3<-copy(test[rep(1:fnrow(test),100000)])

benchresults<-microbenchmark::microbenchmark(control=list(order="block"),
  ".N 10"={
    test[,.(count=.N,filler),by=c("idcol","target")]
  },
  "fnobs join 10"={
    join(
      test2,
      test2|>fgroup_by(idcol,target)|>fnobs()|>fselect(idcol,target,count=filler)
    )
  },
    "fnobs by itself"={
    test2|>fgroup_by(idcol,target)|>fnobs()|>fselect(idcol,target,count=filler)
  },
  ".N 100k"={
    test3[,.(count=.N,filler),by=c("idcol","target")]
  },
  "fnobs join 100k"={
    join(
      test3,
      test3|>fgroup_by(idcol,target)|>fnobs()|>fselect(idcol,target,count=filler)
    )
  },
  "fnobs by itself 100k"={
    test3|>fgroup_by(idcol,target)|>fnobs()|>fselect(idcol,target,count=filler)
  }
)

Feb 25 '24 11:02 D3SL

Not sure I fully understand it, but you have several options:

# Count observation by idcol and target, add to dataset using add = TRUE
test |> fcount(idcol, target, name = "count", add = TRUE)
# Same thing
test |> fmutate(count = GRPN(list(idcol, target)))
# Now more what I thought the question was about: counting unique values in one column grouped by another
test |> fmutate(count = fndistinct(target, idcol, "fill"))

Feb 25 '24 21:02 SebKrantz