collapse
collapse copied to clipboard
feature request: fnobs() without altering other table columns
One of my workflows relies on counting unique values in one column grouped by another. Using data.table's .N I can get that count as a column without altering anything else. In collapse afaik this isn't possible, the closest approximate I've found is joining a table to a copy of itself.
In terms of performance the overhead seems minimal, only 9k microseconds at 100k rows of test data. The main issue is the human component. It's not a trivial amount of extra typing and the fselect and join are both pain points where errors could creep in.
Unit: microseconds
expr min lq mean median uq max neval
.N 10 572 580 626 588 600 2155 100
fnobs join 10 387 396 475 409 501 880 100
fnobs by itself 77 80 104 83 94 380 100
.N 100k 63568 75143 104600 83375 93415 219051 100
fnobs join 100k 35900 36406 40976 38768 45227 53712 100
fnobs by itself 100k 29286 29478 32771 29755 32207 142688 100
code to produce benchmark:
library(data.table)
library(collapse)
test<-data.table(
idcol=c(rep("foo",3),rep("bar",3),rep("baz",3),"lonesome george"),
filler=letters[1:10],
target=c(1,2,3,1,1,2,1,1,1,1)
)
test2<-copy(test)
test3<-copy(test[rep(1:fnrow(test),100000)])
benchresults<-microbenchmark::microbenchmark(control=list(order="block"),
".N 10"={
test[,.(count=.N,filler),by=c("idcol","target")]
},
"fnobs join 10"={
join(
test2,
test2|>fgroup_by(idcol,target)|>fnobs()|>fselect(idcol,target,count=filler)
)
},
"fnobs by itself"={
test2|>fgroup_by(idcol,target)|>fnobs()|>fselect(idcol,target,count=filler)
},
".N 100k"={
test3[,.(count=.N,filler),by=c("idcol","target")]
},
"fnobs join 100k"={
join(
test3,
test3|>fgroup_by(idcol,target)|>fnobs()|>fselect(idcol,target,count=filler)
)
},
"fnobs by itself 100k"={
test3|>fgroup_by(idcol,target)|>fnobs()|>fselect(idcol,target,count=filler)
}
)
Not sure I fully understand it, but you have several options:
# Count observation by idcol and target, add to dataset using add = TRUE
test |> fcount(idcol, target, name = "count", add = TRUE)
# Same thing
test |> fmutate(count = GRPN(list(idcol, target)))
# Now more what I thought the question was about: counting unique values in one column grouped by another
test |> fmutate(count = fndistinct(target, idcol, "fill"))