datawizard icon indicating copy to clipboard operation
datawizard copied to clipboard

add a `data_arrange()`?

Open etiennebacher opened this issue 2 years ago • 29 comments

If I'm correct, there's nothing to do this in datawizard so far. I think it would be useful, what do you think?

I made a small function but for now it cannot arrange in decreasing order:

data_arrange <- function(data, ...) {
  
  el <- c(...)
  dont_exist <- el[which(!el %in% names(data))]
  if (length(dont_exist) > 0) {
    stop(insight::format_message(
      paste0(
        "The following column(s) don't exist in the dataset: ",
        datawizard::text_concatenate(dont_exist), "."
      )
    ), call. = FALSE)
  }
  
  if (length(el) == 1) {
    data[order(data[[el]]), ]
  } else {
    data[do.call(order, data[, el]), ]
  }
  
}


data_arrange(head(mtcars), "cyl", "drat")
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

data_arrange(head(iris), "Sepal.Length")
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 4          4.6         3.1          1.5         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 1          5.1         3.5          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

data_arrange(head(iris), "foo")
#> Error: The following column(s) don't exist in the dataset: foo.

Created on 2022-07-05 by the reprex package (v2.0.1)

etiennebacher avatar Jul 05 '22 15:07 etiennebacher

Love it! Another equivalent of a dplyr function, which, I am sure, will also be helpful in easystats packages.

Feel free to make a PR.

IndrajeetPatil avatar Jul 05 '22 16:07 IndrajeetPatil

Something I also often use is summarise(). However, I think we had a discussion about "copying" more functions from dplyr/tidyr to datawizard, and iirc, we didn't want to implement new functions if these are not urgently needed (e.g. because we often use it internally in our packages).

Tagging @bwiernik and @DominiqueMakowski, I think at least one of them were involved in that discussion.

General comment: I would follow our API and use select to select variables to arrange.

strengejacke avatar Jul 05 '22 17:07 strengejacke

(I personally won't mind to add some more functions, maybe the consideration would be whether adding more functions will detract from datawizard's core functions?)

strengejacke avatar Jul 05 '22 17:07 strengejacke

I noticed data_arrange() was missing when working on reshape_longer() because I needed to arrange a dataframe with several columns and the base R code is a bit inelegant. However, I'm not sure this is needed in other functions or in the other packages

Edit: here's the previous discussion about this: https://github.com/easystats/datawizard/issues/130

etiennebacher avatar Jul 05 '22 17:07 etiennebacher

That's generally my perspective @strengejacke. Unless the function (1) solves a problem that tidyverse doesn't (most of our stat transformations) or (2) does functionality we need internally that requires very hard to write/read or fragile base R code, we should generally avoid just copying tidyverse functionality. That's, eg, the goal of {poorman}. Not an absolute by any means, just a general perspective that I think we should follow unless there's a good reason otherwise.

bwiernik avatar Jul 05 '22 17:07 bwiernik

Agree with Brenton, and I'll add a maintainability argument, data_arrange is probably fairly straightforward to implement cleanly and maintain, whereas something like summarize() is likely 🤯 (especially if we consider all the variants summarize_at, summarize_if etc)

DominiqueMakowski avatar Jul 06 '22 00:07 DominiqueMakowski

Ok, I suppose that if no one implemented data_arrange() so far, it's because it's not really needed in other easystats packages (even in datawizard, I only found one case where I needed it). So let's close it for now and if it is needed later then maybe we can start from this code

etiennebacher avatar Jul 06 '22 06:07 etiennebacher

No no i would like to see a data_arange in datawizard!

DominiqueMakowski avatar Jul 06 '22 06:07 DominiqueMakowski

Two things then:

  • Use select to select variables for arranging
  • Maybe, like in data_rename(), add a safe argument so the function doesn't error if not wanted.

strengejacke avatar Jul 06 '22 06:07 strengejacke

If we use select it's harder to specify which variable should be increasing or decreasing. Also in my usecases I never use the select helpers in arrange(), I just need to order my data with 2-3 vars max.

I modified a bit the function above to allow for "-" in front of the variable name (won't work correctly if the variable name already starts with "-" but that should be quite rare):

data_arrange <- function(data, ..., safe = TRUE) {

  el <- c(...)

  desc <- el[startsWith(el, "-")]
  desc <- gsub("^-", "", desc)
  el <- gsub("^-", "", el)

  dont_exist <- el[which(!el %in% names(data))]
  if (length(dont_exist) > 0 && safe) {
    stop(insight::format_message(
      paste0(
        "The following column(s) don't exist in the dataset: ",
        datawizard::text_concatenate(dont_exist), "."
      )
    ), call. = FALSE)
  }

  out <- data

  if (length(desc) > 0) {
    for (i in desc) {
      out[[i]] <- -xtfrm(out[[i]])
    }
  }

  if (length(el) == 1) {
    data[order(out[[el]]), ]
  } else {
    data[do.call(order, out[, el]), ]
  }
}

Examples:

library(datawizard)

# for comparison
head(mtcars)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

data_arrange(head(mtcars), "carb")
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
data_arrange(head(mtcars), "gear", "carb")
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
data_arrange(head(mtcars), "-carb")
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
data_arrange(head(mtcars), "-gear", "carb")
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
data_arrange(head(iris), "foo")
#> Error: The following column(s) don't exist in the dataset: foo.

Created on 2022-07-06 by the reprex package (v2.0.1)

etiennebacher avatar Jul 06 '22 10:07 etiennebacher

el has to be updated when safe = FALSE and some variables do not exits:

data_arrange(head(mtcars), "blub", safe = FALSE)

I see your point with ascending/descending order per variable, but I wonder if we can have an argument, like sort_descending, which can be a logical and then applies to all variables in select, or a character vector for certain variables. All others are sorted in ascending order by default.

So instead of

data_arrange(head(mtcars), "-gear", "carb")

we'd have

data_arrange(head(mtcars), select = c("gear", "carb"), sort_descending = "gear")

@DominiqueMakowski @IndrajeetPatil any thoughts? I'm mainly raising this point for internal consistency with our API/function design.

strengejacke avatar Jul 06 '22 11:07 strengejacke

I wonder if we can have an argument, like sort_descending, which can be a logical and then applies to all variables in select, or a character vector for certain variables. All others are sorted in ascending order by default.

I like that

DominiqueMakowski avatar Jul 06 '22 11:07 DominiqueMakowski

el has to be updated when safe = FALSE and some variables do not exits:

Right, I updated the code.

So instead of

data_arrange(head(mtcars), "-gear", "carb")

we'd have

data_arrange(head(mtcars), select = c("gear", "carb"), sort_descending = "gear")

Just to be sure, if you support select helpers in select you also have to support them in sort_descending, right?

To me this looks a bit overcomplicated because you have two arguments and sometimes you need to type twice the variable name. I understand that select is better in terms of consistency across datawizard but I never use select helpers in arrange(), maybe it is more the case for you?

Since this function will be implemented anyway, I'm making a PR with the code I have so far and let's modify it later if needed

etiennebacher avatar Jul 06 '22 12:07 etiennebacher

If we can't have non-standard evaluation (-gear), then I would say the descending argument should be a TRUE/FALSE vector rather than repeating variable names twice

bwiernik avatar Jul 06 '22 12:07 bwiernik

I added data_arrange2() as alternative with different syntax, using select etc.:

library(datawizard)

# by default, all in ascending order
x1 <- data_arrange(head(mtcars), "gear", "carb")
x2 <- data_arrange2(head(mtcars), select = c("gear", "carb"))
identical(x1, x2)
#> [1] TRUE

# `descending` specified, all remaining (none in this ex.) ordered ascending
x1 <- data_arrange(head(mtcars), "-carb")
x2 <- data_arrange2(head(mtcars), select = "carb", descending = "carb")
identical(x1, x2)
#> [1] TRUE

# `descending` specified, all remaining ("gear".) ordered ascending
x1 <- data_arrange(head(mtcars), "gear", "-carb")
x2 <- data_arrange2(head(mtcars), select = c("gear", "carb"), descending = "carb")
identical(x1, x2)
#> [1] TRUE

# `descending` specified, all remaining ("gear", "am") ordered ascending
x1 <- data_arrange(head(mtcars), "gear", "-carb", "am")
x2 <- data_arrange2(head(mtcars), select = c("gear", "carb", "am"), descending = "carb")
identical(x1, x2)
#> [1] TRUE

# `ascending` specified, all remaining ("gear") ordered descending
x1 <- data_arrange(head(mtcars), "gear", "-carb", "-am")
x2 <- data_arrange2(head(mtcars), select = c("gear", "carb", "am"), ascending = "gear")
identical(x1, x2)
#> [1] TRUE

# `ascending` empty, sorting all in descending order
x1 <- data_arrange(head(mtcars), "-gear", "-carb", "-am")
x2 <- data_arrange2(head(mtcars), select = c("gear", "carb", "am"), ascending = "")
identical(x1, x2)
#> [1] TRUE

Created on 2022-07-06 by the reprex package (v2.0.1)

strengejacke avatar Jul 06 '22 12:07 strengejacke

If we can't have non-standard evaluation (-gear), then I would say the descending argument should be a TRUE/FALSE vector rather than repeating variable names twice

An example would be

library(datawizard)
x1 <- data_arrange(head(mtcars), c(setdiff(colnames(mtcars), "cyl"), "-carb"))
x2 <- data_arrange2(head(mtcars), -cyl, descending = carb)
identical(x1, x2)
#> [1] TRUE

Created on 2022-07-06 by the reprex package (v2.0.1)

strengejacke avatar Jul 06 '22 12:07 strengejacke

But NSE only works for one variable, we can't c() more literal variable names, like:

data_arrange2(head(mtcars), c(-cyl, -disp), descending = carb)

strengejacke avatar Jul 06 '22 12:07 strengejacke

x2 <- data_arrange2(head(mtcars), -cyl, descending = carb)

To me this is really confusing, I would think it's decreasing by cyl

etiennebacher avatar Jul 06 '22 12:07 etiennebacher

Yeah, that's how dplyr::arrange() works:

x1 <- as.data.frame(dplyr::arrange(head(mtcars), -cyl, desc(disp)))
x2 <- datawizard::data_arrange2(head(mtcars), select = c("cyl", "disp"), ascending = "")
identical(x1, x2)
#> [1] TRUE

Created on 2022-07-06 by the reprex package (v2.0.1)

strengejacke avatar Jul 06 '22 13:07 strengejacke

An alternative would be to just have ascending and descending, and only variables specified in one of those arguments will be used for sorting, so we can skip select and exclude.

strengejacke avatar Jul 06 '22 13:07 strengejacke

An alternative would be to just have ascending and descending, and only variables specified in one of those arguments will be used for sorting, so we can skip select and exclude.

In this case it's impossible to specify the order in which the dataframe will be sorted, e.g you can't specify drat, -disp, mpg

etiennebacher avatar Jul 06 '22 13:07 etiennebacher

True.

strengejacke avatar Jul 06 '22 13:07 strengejacke

Given the limitations of our NSE, then I think a logical descending argument, defaulting to FALSE would probably be the less onerous?

bwiernik avatar Jul 06 '22 14:07 bwiernik

library(datawizard)

# by default, all in ascending order
x1 <- data_arrange(head(mtcars), "gear", "carb")
x2 <- data_arrange2(head(mtcars), select = c("gear", "carb"))
identical(x1, x2)
#> [1] TRUE

x1 <- data_arrange(head(mtcars), "-carb")
x2 <- data_arrange2(head(mtcars), select = "carb", descending = TRUE)
identical(x1, x2)
#> [1] TRUE

x1 <- data_arrange(head(mtcars), "gear", "-carb")
x2 <- data_arrange2(head(mtcars), select = c("gear", "carb"), descending = c(FALSE, TRUE))
identical(x1, x2)
#> [1] TRUE

x1 <- data_arrange(head(mtcars), "gear", "-carb", "am")
x2 <- data_arrange2(head(mtcars), select = c("gear", "carb", "am"), descending = c(FALSE, TRUE, FALSE))
identical(x1, x2)
#> [1] TRUE

x1 <- data_arrange(head(mtcars), "gear", "-carb", "-am")
x2 <- data_arrange2(head(mtcars), select = c("gear", "carb", "am"), descending = c(FALSE, TRUE, TRUE))
identical(x1, x2)
#> [1] TRUE

x1 <- data_arrange(head(mtcars), "-gear", "-carb", "-am")
x2 <- data_arrange2(head(mtcars), select = c("gear", "carb", "am"), descending = TRUE)
identical(x1, x2)
#> [1] TRUE

Created on 2022-07-06 by the reprex package (v2.0.1)

strengejacke avatar Jul 06 '22 17:07 strengejacke

I personally prefer the https://github.com/easystats/datawizard/issues/193#issuecomment-1176180275 approach, it's more explicit, but have no really strong opinion here.

strengejacke avatar Jul 06 '22 18:07 strengejacke

@easystats/maintainers Now it's time to vote:

  1. data_arrange(data, ..., safe = TRUE)
  • (https://github.com/easystats/datawizard/issues/193#issuecomment-1176042144)
  1. data_arrange(data, select = NULL, exclude = NULL, ascending = NULL, descending = NULL, ignore_case = FALSE, ...)
  • (https://github.com/easystats/datawizard/issues/193#issuecomment-1176180275)
  1. data_arrange(data, select = NULL, exclude = NULL, descending = NULL, ignore_case = FALSE, ...)
  • (https://github.com/easystats/datawizard/issues/193#issuecomment-1176515534)

strengejacke avatar Jul 06 '22 19:07 strengejacke

x2 <- data_arrange2(head(mtcars), select = c("gear", "carb", "am"), descending = c(FALSE, TRUE, TRUE))

I don't want to be the guy who criticizes every single idea except mine :grimacing:, but this looks even weirder to me and it makes it kind of impossible to use select helpers with this. Suppose I want to arrange the data in a decreasing order by every column that starts with "foo". I can use select = starts_with("foo") but how do I know how many TRUE should be repeated in descending?

etiennebacher avatar Jul 06 '22 20:07 etiennebacher

I don't want to be the guy who criticizes every single idea except mine 😬

😄

On the one hand, I agree with you - select helpers are more difficult to use and to match with descending if that's a logical. On the other hand, how many real-world use cases exist where you sort by more than 1 or 2 variables?

My 2 cents:

  1. Pro: short, will be useful in many/most situations; Con: unusual function design (for datawizard), less flexible select
  2. Pro: typical function design, flexible select; Con: can be annoying when you need to repeat definition of variables in both select and ascending
  3. Pro: typical function design, flexible select; Con: descending hardly usable when using select-helpers

strengejacke avatar Jul 06 '22 20:07 strengejacke

On the other hand, how many real-world use cases exist where you sort by more than 1 or 2 variables?

I agree 100% on that, which is why I think it should be super easy and fast (to type) to arrange a dataset by a couple of variables. That's why having all the arguments select, exclude, ascending, descending, etc. looks overcomplicated to me.

My opinion:

  1. Syntax a bit different than other datawizard functions (but easy to understand with examples from the docs), more convenient
  2. Keeps consistency with other datawizard functions, readibility OK but having to write several vars twice is inconvenient and error-prone
  3. Convenience and consistency same as 2. but really decreases readibility IMO

etiennebacher avatar Jul 06 '22 20:07 etiennebacher