datawizard
datawizard copied to clipboard
Improve docs for data_to_wide
Background: https://x.com/stephenjwild/status/1792294171924967920
And I recently wanted to use data_to_wide() and couldn't get it work at first, took some time to figure out the logic behind reshaping to wide again. Especially, the documentation of id_cols was confusing.
but to me those
pivot_*()functions made pivoting/unpivoting so much easier to understand
True, but I think by even makes it even clearer. by refers across our easystats-packages to a variable/column that indicates "grouping". Reshaping to wide format commonly, but not in general, identifies "repeated measurements" by an "ID", or group. And id_cols suggests plural, although often only one column is selected (that was also one of the criticism on Twitter). Thus, id_cols is not even the best choice from a grammar perspective, too ;-)
By @strengejacke in a review comment:
I personally prefer by, both because it's in line with renaming the other arguments across functions into by. Especially, since by refers to a "grouping" variable, and imho, by makes clearer that this specified variables is used for gathering those rows ("groups") that are spread as new columns.
However, since we already allow select (easystats-name) and cols (tidyverse-name), we could probably use both argument names here, too?
However, since we already allow select (easystats-name) and cols (tidyverse-name), we could probably use both argument names here, too?
I think we should deprecate and then remove one of the two. Those pivot functions already have a lot of args so I suggest we avoid adding bloat with aliases.
Thus, id_cols is not even the best choice from a grammar perspective, too ;-)
I don't think there is a best choice for this, one could use id_col but it would suggest this must be a single column while it's not always the case. Just for comparison with other packages:
base::reshape(): idvardata.table::melt(): id.varspandasandpolars: index
Considering a by argument would be quite unique (doesn't mean that other packages always make things right, but it's quite indicative).
Curious about what others think about renaming id_cols to by @easystats/core-team
We slice and reshape by the levels of that variable - it's still my personal favorite, and we use this argument name now aggressively consistent in our packages. I don't mind deviating from other packages, especially since users are free to use those other packages if they prefer that syntax.
I just realize that when you teach R to beginners, it's really good if you can repeat yourself: "use select for..." or "use by for...". Easier to understand and remember for newbies
bump @DominiqueMakowski @IndrajeetPatil @bwiernik @rempsyc @mattansb
I think I agree with @etiennebacher; id_cols' purpose was to eventually specify an index, but i don't think it is conceptually different from what by is expected to do
Sorry @strengejacke 😅 - but I'm not convinced it should be replaced or made an alias
I don't quite understand the use of by as an alias for id_cols. Could you talk about the similarity there?
but i don't think it is conceptually different from what
byis expected to do
Yes, that's why I thought we could use by, because we use it in conceptually similar ways across other functions now.
But I'm ok with sticking to id_cols, though I don't see the need to use that name. The id_cols refers to a variable, whose "levels" is used for grouping/stratification. In every other instance we now use by, not here though.
I think "re-thinking" reshaping data in this way makes it rather easy to use, especially since it would be in line with our other functions.
Btw, for data_to_long(), we have select for selecting columns, but also its alias cols for compatibility with pivot_longer()
I think the conceptual similarity of by to id_cols is very tenuous. I suggest we drop it
Ok, if nobody likes my suggestion (😢), let's stick to id_cols then.
by: