janitor icon indicating copy to clipboard operation
janitor copied to clipboard

adorn_cumulative function for one-way tabyls

Open sfirke opened this issue 6 years ago • 6 comments

Suggested by @elinw in #231.

I am thinking this function takes a data.frame and a column name and appends a column that is the cumulative sum of that target column. It also takes an argument dir specifying whether the function should sum down from the top (default) or up from the bottom.

The default case would be to run on the result of a one-way tabyl so in my version below, when the column name is not specified it first looks for valid_percent and then percent. If both are present, should it generate two cumsum columns, or should it default to valid_percent?


#' @title Add a cumulative sum column to a data.frame.
#'
#' @description A tidyverse-style function to add a cumsum column.
#'
#' @param dat data.frame to add cumulative sum to
#' @param colname specify the unquoted name of the column to sum, or leave it blank in which case the function will default first to a column called "valid_percent" and then for "percent".  These defaults support running this function on the result of calls to \code{janitior::tabyl}.  If no colname is supplied and these default columns are absent in the data.frame, the function will error.
#' @param dir direction to sum; defaults to "down" but can sum from the bottom of a data.frame with "up".  In this case the resulting column name will be "cumulative_up".
#'
#' @return a data.frame.
#' @export
#'
#' @examples
#' library(janitor)
#' mtcars %>%
#'   adorn_cumulative(mpg)
#'
#' mtcars %>%
#'   tabyl(cyl) %>%
#'   adorn_cumulative()
#'
#' # Vector with an NA
#' x <- c(0, 1, 2, 3, 3, 3, NA)
#'
#' x %>%
#'   tabyl() %>%
#'   adorn_cumulative()
#'
#' x %>%
#'   tabyl() %>%
#'   adorn_cumulative(dir = "up")

adorn_cumulative <- function(dat, colname, dir = "down"){

  if(!missing(colname)){
    colname <- rlang::enquo(colname)
  } else if("valid_percent" %in% names(dat)) {
  colname <- rlang::sym("valid_percent")
  } else if("percent" %in% names(dat)){
    colname <- rlang::sym("percent")
  } else {
    stop("\"colname\" not specified and default columns valid_percent and percent are not present in data.frame dat")
  }

  target <- dplyr::pull(dat, !! colname)

  if(dir == "up"){
    target <- rev(target)
  }
  dat$cumulative <- cumsum(ifelse(is.na(target), 0, target)) + target*0 # an na.rm version of cumsum, from https://stackoverflow.com/a/25576972
  if(dir == "up"){
    dat$cumulative <- rev(dat$cumulative)
    names(dat)[names(dat) %in% "cumulative"] <- "cumulative_up"
  }
  dat
}

^^^ this works so try out the examples and give feedback in this issue!

sfirke avatar Sep 19 '18 14:09 sfirke

Hello. Thank you for the package. The cumulative function works, but when used with Totals, it adds up the 100% and ends up to 200%. I have attached the code and result:

ekran resmi 2019-02-13 21 26 08

sbalci avatar Feb 13 '19 18:02 sbalci

I guess adorn_cumulative should check for a totals row, and if so then put something else in the last value. NA is probably easiest. Then adorn_pct_formatting would need to turn NA into something like "-", not NA%. Edit: this is more of a problem when dir = "up", then the 100 from the totals would immediate throw things off.

sfirke avatar Feb 14 '19 03:02 sfirke

I think this will be a useful addition to janitor, but it might have more complexity and require more care than I can devote for the 1.2 release. I'll try, though, and am marking this 1.2.

sfirke avatar Mar 15 '19 03:03 sfirke

This is still worthy but I can't get this done right in the next 24 hours, it will have to wait for some future release. Anyone's welcome to turn the above code into a PR with tests & addressing the issues with totals and formatting identified above.

sfirke avatar Apr 20 '19 17:04 sfirke

Hi @sfirke,

I have been trying to add to the code you provided above, and wanted to share what I have in case it's useful. I think that this will work in any case where adorn_totals() is called and then adorn_cumulative(), but am struggling to find a good way to implement if someone were to round values before adorn_cumulative().

The only thing I can think of is to warn the user if the max column value falls within a threshold of the sum of the cumulative sum.

Example:

library(dplyr)
library(janitor)
set.seed(1)

data.frame(test = "test",
           a = runif(100)) %>% 
  adorn_totals() %>% 
  mutate(a = round(a,0)) %>% 
  adorn_cumulative(a) %>% 
  tail()

It appears you have a 'totals' column. Watch for rounding discrepancies.
  test  a cumulative
  test  0         NA
  test  1         48
  test  0         NA
  test  1         49
  test  1         50
 Total 50         NA

This is obviously more of an issue with large datasets with rounded values.

Anyway, hopefully this is useful and helps to implement adorn_cumulative(). I work with many analysts who are moving from SPSS to R, and this would really help to emulate a SPSS-style FREQUENCIES table.

Edit: Possibly add an ignore_row arg?

set.seed(1)

data.frame(test = "test",
           a = runif(100)) %>% 
  adorn_totals() %>% 
  mutate(a = round(a,0)) %>% 
  arrange(desc(a)) %>% 
  adorn_cumulative(a, ignore_row = "first") %>% 
  head()

 test  a cumulative
 Total 56         NA
  test  1          1
  test  1          2
  test  1          3
  test  1          4
  test  1          5
adorn_cumulative <- function(dat, colname, dir = "down", ignore_row = NULL){
  
  if(!missing(colname)){
    colname <- rlang::enquo(colname)
  } else if("valid_percent" %in% names(dat)) {
    colname <- rlang::sym("valid_percent")
  } else if("percent" %in% names(dat)){
    colname <- rlang::sym("percent")
  } else {
    stop("\"colname\" not specified and default columns valid_percent and percent are not present in data.frame dat")
  }
  
  if (!dir %in% c("up", "down")) {
    stop("'dir' must be one of 'up', 'down'")
  }
  
  check <- rlang::quo_name(colname)
  
  if(!inherits(dat[[check]], c("numeric"))) {
    stop("column must be of class numeric.")
  }
  
  if (is.null(ignore_row)){
    target <- dplyr::pull(dat, !! colname)
  } else  if(ignore_row == "last"){
    target <- dplyr::pull(dat, !! colname)
    target[length(target)] <- NA
  }else if (ignore_row == "first"){
    target <- dplyr::pull(dat, !! colname)
    target[1] <- NA
  }
  

  sum_is_1 <- sum(target,na.rm = T) - max(target,na.rm = T) == 1  
  
  cumsum_1 <- max(cumsum(target),na.rm = T) - max(target,na.rm = T) == 1 
  
  cumsum_2 <- max(cumsum(target),na.rm = T) - max(target,na.rm = T) == 2 
  
  cumsum_n <- sum(target, na.rm = T) / max(target, na.rm = T) == 2
  
  if ((dir == "up" & sum_is_1 == T) | 
      (dir == "up" & cumsum_1 == T) | 
      (dir== "up" & cumsum_2 == T)  |
      (dir == "up" & cumsum_n == T)) {
    target <- replace(target, target == max(target,na.rm=T), NA)
    message("It appears you have a 'totals' column. Watch for rounding discrepancies.")
    
  } else if (sum_is_1 == T | cumsum_1 == T | cumsum_2 == T | cumsum_n == T) {
    target <- replace(target, target == max(target,na.rm=T), 0)
    message("It appears you have a 'totals' column. Watch for rounding discrepancies.")
    
  } else if ((sum(target,na.rm = T) - max(target,na.rm = T) > 1.9 & sum(target,na.rm = T) - max(target,na.rm = T) < 2.1 ) |
             (max(cumsum(target),na.rm = T) - max(target,na.rm = T) > 1.9 & max(cumsum(target),na.rm = T) - max(target,na.rm = T) < 2.1) |
             (max(cumsum(target),na.rm = T) - max(target,na.rm = T) > 1.9 & max(cumsum(target),na.rm = T) - max(target,na.rm = T) < 2.1) |
             (sum(target, na.rm = T) / max(target, na.rm = T) > 1.9 & sum(target, na.rm = T) / max(target, na.rm = T) < 2.1)){
    warning("It appears you have a rounded 'totals' column. Watch for rounding discrepancies.")
  }
  
  
  
  if(dir == "up"){
    target <- rev(target)
  }
  
  dat$cumulative <- cumsum(ifelse(is.na(target), 0, target)) + target*0 # an na.rm version of cumsum, from https://stackoverflow.com/a/25576972
  dat$cumulative[duplicated(dat$cumulative)] <- NA
  
  if(dir == "up"){
    dat$cumulative <- rev(dat$cumulative)
    names(dat)[names(dat) %in% "cumulative"] <- "cumulative_up"
    
    # Creates NA for repeated values
    # Assumption that only repeated cumulative value would be a total
    dat$cumulative_up[duplicated(dat$cumulative_up)] <- NA  
  }
  dat
}

mattroumaya avatar Feb 05 '21 16:02 mattroumaya

Hi @sfirke,

I have been trying to add to the code you provided above, and wanted to share what I have in case it's useful. I think that this will work in any case where adorn_totals() is called and then adorn_cumulative(), but am struggling to find a good way to implement if someone were to round values before adorn_cumulative().

The only thing I can think of is to warn the user if the max column value falls within a threshold of the sum of the cumulative sum.

Example:

library(dplyr)
library(janitor)
set.seed(1)

data.frame(test = "test",
           a = runif(100)) %>% 
  adorn_totals() %>% 
  mutate(a = round(a,0)) %>% 
  adorn_cumulative(a) %>% 
  tail()

It appears you have a 'totals' column. Watch for rounding discrepancies.
  test  a cumulative
  test  0         NA
  test  1         48
  test  0         NA
  test  1         49
  test  1         50
 Total 50         NA

This is obviously more of an issue with large datasets with rounded values.

Anyway, hopefully this is useful and helps to implement adorn_cumulative(). I work with many analysts who are moving from SPSS to R, and this would really help to emulate a SPSS-style FREQUENCIES table.

Edit: Possibly add an ignore_row arg?

set.seed(1)

data.frame(test = "test",
           a = runif(100)) %>% 
  adorn_totals() %>% 
  mutate(a = round(a,0)) %>% 
  arrange(desc(a)) %>% 
  adorn_cumulative(a, ignore_row = "first") %>% 
  head()

 test  a cumulative
 Total 56         NA
  test  1          1
  test  1          2
  test  1          3
  test  1          4
  test  1          5
adorn_cumulative <- function(dat, colname, dir = "down", ignore_row = NULL){
 
 if(!missing(colname)){
   colname <- rlang::enquo(colname)
 } else if("valid_percent" %in% names(dat)) {
   colname <- rlang::sym("valid_percent")
 } else if("percent" %in% names(dat)){
   colname <- rlang::sym("percent")
 } else {
   stop("\"colname\" not specified and default columns valid_percent and percent are not present in data.frame dat")
 }
 
 if (!dir %in% c("up", "down")) {
   stop("'dir' must be one of 'up', 'down'")
 }
 
 check <- rlang::quo_name(colname)
 
 if(!inherits(dat[[check]], c("numeric"))) {
   stop("column must be of class numeric.")
 }
 
 if (is.null(ignore_row)){
   target <- dplyr::pull(dat, !! colname)
 } else  if(ignore_row == "last"){
   target <- dplyr::pull(dat, !! colname)
   target[length(target)] <- NA
 }else if (ignore_row == "first"){
   target <- dplyr::pull(dat, !! colname)
   target[1] <- NA
 }
 

 sum_is_1 <- sum(target,na.rm = T) - max(target,na.rm = T) == 1  
 
 cumsum_1 <- max(cumsum(target),na.rm = T) - max(target,na.rm = T) == 1 
 
 cumsum_2 <- max(cumsum(target),na.rm = T) - max(target,na.rm = T) == 2 
 
 cumsum_n <- sum(target, na.rm = T) / max(target, na.rm = T) == 2
 
 if ((dir == "up" & sum_is_1 == T) | 
     (dir == "up" & cumsum_1 == T) | 
     (dir== "up" & cumsum_2 == T)  |
     (dir == "up" & cumsum_n == T)) {
   target <- replace(target, target == max(target,na.rm=T), NA)
   message("It appears you have a 'totals' column. Watch for rounding discrepancies.")
   
 } else if (sum_is_1 == T | cumsum_1 == T | cumsum_2 == T | cumsum_n == T) {
   target <- replace(target, target == max(target,na.rm=T), 0)
   message("It appears you have a 'totals' column. Watch for rounding discrepancies.")
   
 } else if ((sum(target,na.rm = T) - max(target,na.rm = T) > 1.9 & sum(target,na.rm = T) - max(target,na.rm = T) < 2.1 ) |
            (max(cumsum(target),na.rm = T) - max(target,na.rm = T) > 1.9 & max(cumsum(target),na.rm = T) - max(target,na.rm = T) < 2.1) |
            (max(cumsum(target),na.rm = T) - max(target,na.rm = T) > 1.9 & max(cumsum(target),na.rm = T) - max(target,na.rm = T) < 2.1) |
            (sum(target, na.rm = T) / max(target, na.rm = T) > 1.9 & sum(target, na.rm = T) / max(target, na.rm = T) < 2.1)){
   warning("It appears you have a rounded 'totals' column. Watch for rounding discrepancies.")
 }
 
 
 
 if(dir == "up"){
   target <- rev(target)
 }
 
 dat$cumulative <- cumsum(ifelse(is.na(target), 0, target)) + target*0 # an na.rm version of cumsum, from https://stackoverflow.com/a/25576972
 dat$cumulative[duplicated(dat$cumulative)] <- NA
 
 if(dir == "up"){
   dat$cumulative <- rev(dat$cumulative)
   names(dat)[names(dat) %in% "cumulative"] <- "cumulative_up"
   
   # Creates NA for repeated values
   # Assumption that only repeated cumulative value would be a total
   dat$cumulative_up[duplicated(dat$cumulative_up)] <- NA  
 }
 dat
}

Any suggestions on how one could amend the above code for cumulative percentage?

truenomad avatar Dec 24 '21 22:12 truenomad