datawizard icon indicating copy to clipboard operation
datawizard copied to clipboard

Improve docs for data_to_wide

Open strengejacke opened this issue 1 year ago • 5 comments

Background: https://x.com/stephenjwild/status/1792294171924967920 And I recently wanted to use data_to_wide() and couldn't get it work at first, took some time to figure out the logic behind reshaping to wide again. Especially, the documentation of id_cols was confusing.

strengejacke avatar May 20 '24 09:05 strengejacke

but to me those pivot_*() functions made pivoting/unpivoting so much easier to understand

True, but I think by even makes it even clearer. by refers across our easystats-packages to a variable/column that indicates "grouping". Reshaping to wide format commonly, but not in general, identifies "repeated measurements" by an "ID", or group. And id_cols suggests plural, although often only one column is selected (that was also one of the criticism on Twitter). Thus, id_cols is not even the best choice from a grammar perspective, too ;-)

strengejacke avatar May 21 '24 16:05 strengejacke

By @strengejacke in a review comment:

I personally prefer by, both because it's in line with renaming the other arguments across functions into by. Especially, since by refers to a "grouping" variable, and imho, by makes clearer that this specified variables is used for gathering those rows ("groups") that are spread as new columns.

However, since we already allow select (easystats-name) and cols (tidyverse-name), we could probably use both argument names here, too?

etiennebacher avatar May 24 '24 12:05 etiennebacher

However, since we already allow select (easystats-name) and cols (tidyverse-name), we could probably use both argument names here, too?

I think we should deprecate and then remove one of the two. Those pivot functions already have a lot of args so I suggest we avoid adding bloat with aliases.

Thus, id_cols is not even the best choice from a grammar perspective, too ;-)

I don't think there is a best choice for this, one could use id_col but it would suggest this must be a single column while it's not always the case. Just for comparison with other packages:

  • base::reshape(): idvar
  • data.table::melt(): id.vars
  • pandas and polars: index

Considering a by argument would be quite unique (doesn't mean that other packages always make things right, but it's quite indicative).

Curious about what others think about renaming id_cols to by @easystats/core-team

etiennebacher avatar May 24 '24 13:05 etiennebacher

We slice and reshape by the levels of that variable - it's still my personal favorite, and we use this argument name now aggressively consistent in our packages. I don't mind deviating from other packages, especially since users are free to use those other packages if they prefer that syntax.

I just realize that when you teach R to beginners, it's really good if you can repeat yourself: "use select for..." or "use by for...". Easier to understand and remember for newbies

strengejacke avatar May 24 '24 16:05 strengejacke

bump @DominiqueMakowski @IndrajeetPatil @bwiernik @rempsyc @mattansb

strengejacke avatar May 26 '24 12:05 strengejacke

I think I agree with @etiennebacher; id_cols' purpose was to eventually specify an index, but i don't think it is conceptually different from what by is expected to do

Sorry @strengejacke 😅 - but I'm not convinced it should be replaced or made an alias

DominiqueMakowski avatar May 30 '24 08:05 DominiqueMakowski

I don't quite understand the use of by as an alias for id_cols. Could you talk about the similarity there?

bwiernik avatar May 30 '24 13:05 bwiernik

but i don't think it is conceptually different from what by is expected to do

Yes, that's why I thought we could use by, because we use it in conceptually similar ways across other functions now.

But I'm ok with sticking to id_cols, though I don't see the need to use that name. The id_cols refers to a variable, whose "levels" is used for grouping/stratification. In every other instance we now use by, not here though.

I think "re-thinking" reshaping data in this way makes it rather easy to use, especially since it would be in line with our other functions.

strengejacke avatar May 30 '24 16:05 strengejacke

Btw, for data_to_long(), we have select for selecting columns, but also its alias cols for compatibility with pivot_longer()

strengejacke avatar May 30 '24 18:05 strengejacke

I think the conceptual similarity of by to id_cols is very tenuous. I suggest we drop it

bwiernik avatar May 31 '24 03:05 bwiernik

Ok, if nobody likes my suggestion (😢), let's stick to id_cols then.

strengejacke avatar May 31 '24 06:05 strengejacke

by:

exterminador-do

strengejacke avatar May 31 '24 14:05 strengejacke