arquero icon indicating copy to clipboard operation
arquero copied to clipboard

feature request: can derive accept aq.all, aq.not, etc.:

Open dan-reznik opened this issue 3 years ago • 6 comments

there might not be a possible solution, but I would like a derive that would apply to any range of columns specified with aq.all(), aq.range(), aq.not, etc.

table
   .derive(aq.all(), aq.escapeAny(some operation applied to all columns)

Other useful column selectors might be: aq.allNumeric(), aq.allString(), aq.allDate(), etc., this is super useful in R.

dan-reznik avatar Jun 02 '21 22:06 dan-reznik

I don't know if derive is the right verb for this, but something akin to this could be possible. Can you give one or more specific examples of the kinds of calculations you'd like to do here. That might help me think through this more carefully.

jheer avatar Jun 02 '21 22:06 jheer

ex 1) given some table w dozens of columns a) select only the numeric columns b) truncate all of them in one shot to integers. (e.g., Math.floor)

ex 2) select all date columns, and reformat as a string hh:mm

ex 3) select a subset of columns (e.g., w/ an array) and apply a bulk transformation, etc.

this is very common and indeed very convenient when we are wrangling tables w hundreds of columns.

dan

On Wed, Jun 2, 2021 at 7:15 PM Jeffrey Heer @.***> wrote:

I don't know if derive is the right verb for this, but something akin to this could be possible. Can you give one or more specific examples of the kinds of calculations you'd like to do here. That might help me think through this more carefully.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uwdata/arquero/issues/200#issuecomment-853418245, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD32VFRL54P2IXXCU76S6X3TQ2UPFANCNFSM457SIZHA .

dan-reznik avatar Jun 03 '21 00:06 dan-reznik

Jeff with R's dplyr, the column selection method (which applied not only do mutate = derive, rename, filter, etc.) was based on variants of the transformation function, for example:

select_if(...) // rename_if, mutate_if, etc. select_all(...) select_at(...)


more recently, this was superseded by

select, rename, mutate, etc. keeping the same names, but with the "across" operator within as an argument to select the desired column subset. I prefer the former method.


a useful reference page: https://dplyr.tidyverse.org/reference/select_all.html

On Wed, Jun 2, 2021 at 9:27 PM Dan Reznik @.***> wrote:

ex 1) given some table w dozens of columns a) select only the numeric columns b) truncate all of them in one shot to integers. (e.g., Math.floor)

ex 2) select all date columns, and reformat as a string hh:mm

ex 3) select a subset of columns (e.g., w/ an array) and apply a bulk transformation, etc.

this is very common and indeed very convenient when we are wrangling tables w hundreds of columns.

dan

On Wed, Jun 2, 2021 at 7:15 PM Jeffrey Heer @.***> wrote:

I don't know if derive is the right verb for this, but something akin to this could be possible. Can you give one or more specific examples of the kinds of calculations you'd like to do here. That might help me think through this more carefully.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uwdata/arquero/issues/200#issuecomment-853418245, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD32VFRL54P2IXXCU76S6X3TQ2UPFANCNFSM457SIZHA .

dan-reznik avatar Jun 03 '21 11:06 dan-reznik

I too often need something like this. In R I tend to do it with across, which I believe is a newer version of the mutat_if and mutate_all patterns and is a bit more flexible. Heres an example:

iris %>%
  as_tibble() %>%
  mutate(across(contains("length"), round))

iris %>%
  group_by(Species) %>%
  summarise(across(starts_with("Sepal"), list(mean = mean, sd = sd)))

I could imagine something like the following being an Arquero analog of that pattern:

aq.from(iris)
  .derive({
      (r,c) => op.round(r[c])) // makes a column named c for each value of c
    },
   {across: aq.matches("length")}
  )

aq.from(iris)
  .rollup({
      mean: (r.c) => op.mean(r[c]), // makes a column named c_mean for each value of c
      sd: (r.c) => op.stdev(r[c]) // makes a column named c_sd for each value of c
    },
    {across: aq.startswith("Sepal")}
  )

Where c is a column name from those provided to across and there might also be an option for sep that defaults to _.

markwhiting avatar Jul 07 '21 03:07 markwhiting

Found this thread trying to find the solution for my use case. image

I have dynamic column names, generated from data by .pivot. And now I want to group by total and summarise dynamic columns:

table
  .groupby(/* all_except_the_dynamic_columns */)
  .rollup({
    // cannot use this, because column names and number are dynamic
    dynamic_col1: op.sum('dynamic_col1'), 
    ...
    dynamic_colN: op.sum('dynamic_colN')
  })

The operation that will make the example above look like this: image

Maybe I am missing something and there is a way to do this already?

oluckyman avatar Sep 23 '22 17:09 oluckyman

For now I managed to get the desired effect by building rollup object dynamically:

const dynamicColumns = table
    .columnNames()
    .filter((c) => ![..., "totalHits"].includes(c));

table
    .groupby(..., "hitsTotal")
    .rollup(dynamicColumns.reduce((rollup, col) => ({ ...rollup, [col]: op.sum(col) }), {}));

oluckyman avatar Sep 24 '22 07:09 oluckyman