arquero
arquero copied to clipboard
feature request: can derive accept aq.all, aq.not, etc.:
there might not be a possible solution, but I would like a derive that would apply to any range of columns specified with aq.all(), aq.range(), aq.not, etc.
table
.derive(aq.all(), aq.escapeAny(some operation applied to all columns)
Other useful column selectors might be: aq.allNumeric(), aq.allString(), aq.allDate(), etc., this is super useful in R.
I don't know if derive is the right verb for this, but something akin to this could be possible. Can you give one or more specific examples of the kinds of calculations you'd like to do here. That might help me think through this more carefully.
ex 1) given some table w dozens of columns a) select only the numeric columns b) truncate all of them in one shot to integers. (e.g., Math.floor)
ex 2) select all date columns, and reformat as a string hh:mm
ex 3) select a subset of columns (e.g., w/ an array) and apply a bulk transformation, etc.
this is very common and indeed very convenient when we are wrangling tables w hundreds of columns.
dan
On Wed, Jun 2, 2021 at 7:15 PM Jeffrey Heer @.***> wrote:
I don't know if derive is the right verb for this, but something akin to this could be possible. Can you give one or more specific examples of the kinds of calculations you'd like to do here. That might help me think through this more carefully.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uwdata/arquero/issues/200#issuecomment-853418245, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD32VFRL54P2IXXCU76S6X3TQ2UPFANCNFSM457SIZHA .
Jeff with R's dplyr, the column selection method (which applied not only do mutate = derive, rename, filter, etc.) was based on variants of the transformation function, for example:
select_if(...) // rename_if, mutate_if, etc. select_all(...) select_at(...)
more recently, this was superseded by
select, rename, mutate, etc. keeping the same names, but with the "across" operator within as an argument to select the desired column subset. I prefer the former method.
a useful reference page: https://dplyr.tidyverse.org/reference/select_all.html
On Wed, Jun 2, 2021 at 9:27 PM Dan Reznik @.***> wrote:
ex 1) given some table w dozens of columns a) select only the numeric columns b) truncate all of them in one shot to integers. (e.g., Math.floor)
ex 2) select all date columns, and reformat as a string hh:mm
ex 3) select a subset of columns (e.g., w/ an array) and apply a bulk transformation, etc.
this is very common and indeed very convenient when we are wrangling tables w hundreds of columns.
dan
On Wed, Jun 2, 2021 at 7:15 PM Jeffrey Heer @.***> wrote:
I don't know if derive is the right verb for this, but something akin to this could be possible. Can you give one or more specific examples of the kinds of calculations you'd like to do here. That might help me think through this more carefully.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uwdata/arquero/issues/200#issuecomment-853418245, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD32VFRL54P2IXXCU76S6X3TQ2UPFANCNFSM457SIZHA .
I too often need something like this. In R I tend to do it with across
, which I believe is a newer version of the mutat_if
and mutate_all
patterns and is a bit more flexible. Heres an example:
iris %>%
as_tibble() %>%
mutate(across(contains("length"), round))
iris %>%
group_by(Species) %>%
summarise(across(starts_with("Sepal"), list(mean = mean, sd = sd)))
I could imagine something like the following being an Arquero analog of that pattern:
aq.from(iris)
.derive({
(r,c) => op.round(r[c])) // makes a column named c for each value of c
},
{across: aq.matches("length")}
)
aq.from(iris)
.rollup({
mean: (r.c) => op.mean(r[c]), // makes a column named c_mean for each value of c
sd: (r.c) => op.stdev(r[c]) // makes a column named c_sd for each value of c
},
{across: aq.startswith("Sepal")}
)
Where c
is a column name from those provided to across
and there might also be an option for sep
that defaults to _
.
Found this thread trying to find the solution for my use case.
I have dynamic column names, generated from data by .pivot
.
And now I want to group by total and summarise dynamic columns:
table
.groupby(/* all_except_the_dynamic_columns */)
.rollup({
// cannot use this, because column names and number are dynamic
dynamic_col1: op.sum('dynamic_col1'),
...
dynamic_colN: op.sum('dynamic_colN')
})
The operation that will make the example above look like this:
Maybe I am missing something and there is a way to do this already?
For now I managed to get the desired effect by building rollup object dynamically:
const dynamicColumns = table
.columnNames()
.filter((c) => ![..., "totalHits"].includes(c));
table
.groupby(..., "hitsTotal")
.rollup(dynamicColumns.reduce((rollup, col) => ({ ...rollup, [col]: op.sum(col) }), {}));