janitor
janitor copied to clipboard
adorn_cumulative function for one-way tabyls
Suggested by @elinw in #231.
I am thinking this function takes a data.frame and a column name and appends a column that is the cumulative sum of that target column. It also takes an argument dir
specifying whether the function should sum down from the top (default) or up from the bottom.
The default case would be to run on the result of a one-way tabyl
so in my version below, when the column name is not specified it first looks for valid_percent
and then percent
. If both are present, should it generate two cumsum columns, or should it default to valid_percent
?
#' @title Add a cumulative sum column to a data.frame.
#'
#' @description A tidyverse-style function to add a cumsum column.
#'
#' @param dat data.frame to add cumulative sum to
#' @param colname specify the unquoted name of the column to sum, or leave it blank in which case the function will default first to a column called "valid_percent" and then for "percent". These defaults support running this function on the result of calls to \code{janitior::tabyl}. If no colname is supplied and these default columns are absent in the data.frame, the function will error.
#' @param dir direction to sum; defaults to "down" but can sum from the bottom of a data.frame with "up". In this case the resulting column name will be "cumulative_up".
#'
#' @return a data.frame.
#' @export
#'
#' @examples
#' library(janitor)
#' mtcars %>%
#' adorn_cumulative(mpg)
#'
#' mtcars %>%
#' tabyl(cyl) %>%
#' adorn_cumulative()
#'
#' # Vector with an NA
#' x <- c(0, 1, 2, 3, 3, 3, NA)
#'
#' x %>%
#' tabyl() %>%
#' adorn_cumulative()
#'
#' x %>%
#' tabyl() %>%
#' adorn_cumulative(dir = "up")
adorn_cumulative <- function(dat, colname, dir = "down"){
if(!missing(colname)){
colname <- rlang::enquo(colname)
} else if("valid_percent" %in% names(dat)) {
colname <- rlang::sym("valid_percent")
} else if("percent" %in% names(dat)){
colname <- rlang::sym("percent")
} else {
stop("\"colname\" not specified and default columns valid_percent and percent are not present in data.frame dat")
}
target <- dplyr::pull(dat, !! colname)
if(dir == "up"){
target <- rev(target)
}
dat$cumulative <- cumsum(ifelse(is.na(target), 0, target)) + target*0 # an na.rm version of cumsum, from https://stackoverflow.com/a/25576972
if(dir == "up"){
dat$cumulative <- rev(dat$cumulative)
names(dat)[names(dat) %in% "cumulative"] <- "cumulative_up"
}
dat
}
^^^ this works so try out the examples and give feedback in this issue!
Hello. Thank you for the package. The cumulative function works, but when used with Totals, it adds up the 100% and ends up to 200%. I have attached the code and result:

I guess adorn_cumulative
should check for a totals row, and if so then put something else in the last value. NA
is probably easiest. Then adorn_pct_formatting
would need to turn NA
into something like "-"
, not NA%
. Edit: this is more of a problem when dir = "up"
, then the 100 from the totals would immediate throw things off.
I think this will be a useful addition to janitor, but it might have more complexity and require more care than I can devote for the 1.2 release. I'll try, though, and am marking this 1.2.
This is still worthy but I can't get this done right in the next 24 hours, it will have to wait for some future release. Anyone's welcome to turn the above code into a PR with tests & addressing the issues with totals and formatting identified above.
Hi @sfirke,
I have been trying to add to the code you provided above, and wanted to share what I have in case it's useful. I think that this will work in any case where adorn_totals()
is called and then adorn_cumulative()
, but am struggling to find a good way to implement if someone were to round values before adorn_cumulative().
The only thing I can think of is to warn the user if the max column value falls within a threshold of the sum of the cumulative sum.
Example:
library(dplyr)
library(janitor)
set.seed(1)
data.frame(test = "test",
a = runif(100)) %>%
adorn_totals() %>%
mutate(a = round(a,0)) %>%
adorn_cumulative(a) %>%
tail()
It appears you have a 'totals' column. Watch for rounding discrepancies.
test a cumulative
test 0 NA
test 1 48
test 0 NA
test 1 49
test 1 50
Total 50 NA
This is obviously more of an issue with large datasets with rounded values.
Anyway, hopefully this is useful and helps to implement adorn_cumulative()
. I work with many analysts who are moving from SPSS to R, and this would really help to emulate a SPSS-style FREQUENCIES
table.
Edit: Possibly add an ignore_row
arg?
set.seed(1)
data.frame(test = "test",
a = runif(100)) %>%
adorn_totals() %>%
mutate(a = round(a,0)) %>%
arrange(desc(a)) %>%
adorn_cumulative(a, ignore_row = "first") %>%
head()
test a cumulative
Total 56 NA
test 1 1
test 1 2
test 1 3
test 1 4
test 1 5
adorn_cumulative <- function(dat, colname, dir = "down", ignore_row = NULL){
if(!missing(colname)){
colname <- rlang::enquo(colname)
} else if("valid_percent" %in% names(dat)) {
colname <- rlang::sym("valid_percent")
} else if("percent" %in% names(dat)){
colname <- rlang::sym("percent")
} else {
stop("\"colname\" not specified and default columns valid_percent and percent are not present in data.frame dat")
}
if (!dir %in% c("up", "down")) {
stop("'dir' must be one of 'up', 'down'")
}
check <- rlang::quo_name(colname)
if(!inherits(dat[[check]], c("numeric"))) {
stop("column must be of class numeric.")
}
if (is.null(ignore_row)){
target <- dplyr::pull(dat, !! colname)
} else if(ignore_row == "last"){
target <- dplyr::pull(dat, !! colname)
target[length(target)] <- NA
}else if (ignore_row == "first"){
target <- dplyr::pull(dat, !! colname)
target[1] <- NA
}
sum_is_1 <- sum(target,na.rm = T) - max(target,na.rm = T) == 1
cumsum_1 <- max(cumsum(target),na.rm = T) - max(target,na.rm = T) == 1
cumsum_2 <- max(cumsum(target),na.rm = T) - max(target,na.rm = T) == 2
cumsum_n <- sum(target, na.rm = T) / max(target, na.rm = T) == 2
if ((dir == "up" & sum_is_1 == T) |
(dir == "up" & cumsum_1 == T) |
(dir== "up" & cumsum_2 == T) |
(dir == "up" & cumsum_n == T)) {
target <- replace(target, target == max(target,na.rm=T), NA)
message("It appears you have a 'totals' column. Watch for rounding discrepancies.")
} else if (sum_is_1 == T | cumsum_1 == T | cumsum_2 == T | cumsum_n == T) {
target <- replace(target, target == max(target,na.rm=T), 0)
message("It appears you have a 'totals' column. Watch for rounding discrepancies.")
} else if ((sum(target,na.rm = T) - max(target,na.rm = T) > 1.9 & sum(target,na.rm = T) - max(target,na.rm = T) < 2.1 ) |
(max(cumsum(target),na.rm = T) - max(target,na.rm = T) > 1.9 & max(cumsum(target),na.rm = T) - max(target,na.rm = T) < 2.1) |
(max(cumsum(target),na.rm = T) - max(target,na.rm = T) > 1.9 & max(cumsum(target),na.rm = T) - max(target,na.rm = T) < 2.1) |
(sum(target, na.rm = T) / max(target, na.rm = T) > 1.9 & sum(target, na.rm = T) / max(target, na.rm = T) < 2.1)){
warning("It appears you have a rounded 'totals' column. Watch for rounding discrepancies.")
}
if(dir == "up"){
target <- rev(target)
}
dat$cumulative <- cumsum(ifelse(is.na(target), 0, target)) + target*0 # an na.rm version of cumsum, from https://stackoverflow.com/a/25576972
dat$cumulative[duplicated(dat$cumulative)] <- NA
if(dir == "up"){
dat$cumulative <- rev(dat$cumulative)
names(dat)[names(dat) %in% "cumulative"] <- "cumulative_up"
# Creates NA for repeated values
# Assumption that only repeated cumulative value would be a total
dat$cumulative_up[duplicated(dat$cumulative_up)] <- NA
}
dat
}
Hi @sfirke,
I have been trying to add to the code you provided above, and wanted to share what I have in case it's useful. I think that this will work in any case where
adorn_totals()
is called and thenadorn_cumulative()
, but am struggling to find a good way to implement if someone were to round values beforeadorn_cumulative().
The only thing I can think of is to warn the user if the max column value falls within a threshold of the sum of the cumulative sum.
Example:
library(dplyr) library(janitor) set.seed(1) data.frame(test = "test", a = runif(100)) %>% adorn_totals() %>% mutate(a = round(a,0)) %>% adorn_cumulative(a) %>% tail() It appears you have a 'totals' column. Watch for rounding discrepancies. test a cumulative test 0 NA test 1 48 test 0 NA test 1 49 test 1 50 Total 50 NA
This is obviously more of an issue with large datasets with rounded values.
Anyway, hopefully this is useful and helps to implement
adorn_cumulative()
. I work with many analysts who are moving from SPSS to R, and this would really help to emulate a SPSS-styleFREQUENCIES
table.Edit: Possibly add an
ignore_row
arg?set.seed(1) data.frame(test = "test", a = runif(100)) %>% adorn_totals() %>% mutate(a = round(a,0)) %>% arrange(desc(a)) %>% adorn_cumulative(a, ignore_row = "first") %>% head() test a cumulative Total 56 NA test 1 1 test 1 2 test 1 3 test 1 4 test 1 5
adorn_cumulative <- function(dat, colname, dir = "down", ignore_row = NULL){ if(!missing(colname)){ colname <- rlang::enquo(colname) } else if("valid_percent" %in% names(dat)) { colname <- rlang::sym("valid_percent") } else if("percent" %in% names(dat)){ colname <- rlang::sym("percent") } else { stop("\"colname\" not specified and default columns valid_percent and percent are not present in data.frame dat") } if (!dir %in% c("up", "down")) { stop("'dir' must be one of 'up', 'down'") } check <- rlang::quo_name(colname) if(!inherits(dat[[check]], c("numeric"))) { stop("column must be of class numeric.") } if (is.null(ignore_row)){ target <- dplyr::pull(dat, !! colname) } else if(ignore_row == "last"){ target <- dplyr::pull(dat, !! colname) target[length(target)] <- NA }else if (ignore_row == "first"){ target <- dplyr::pull(dat, !! colname) target[1] <- NA } sum_is_1 <- sum(target,na.rm = T) - max(target,na.rm = T) == 1 cumsum_1 <- max(cumsum(target),na.rm = T) - max(target,na.rm = T) == 1 cumsum_2 <- max(cumsum(target),na.rm = T) - max(target,na.rm = T) == 2 cumsum_n <- sum(target, na.rm = T) / max(target, na.rm = T) == 2 if ((dir == "up" & sum_is_1 == T) | (dir == "up" & cumsum_1 == T) | (dir== "up" & cumsum_2 == T) | (dir == "up" & cumsum_n == T)) { target <- replace(target, target == max(target,na.rm=T), NA) message("It appears you have a 'totals' column. Watch for rounding discrepancies.") } else if (sum_is_1 == T | cumsum_1 == T | cumsum_2 == T | cumsum_n == T) { target <- replace(target, target == max(target,na.rm=T), 0) message("It appears you have a 'totals' column. Watch for rounding discrepancies.") } else if ((sum(target,na.rm = T) - max(target,na.rm = T) > 1.9 & sum(target,na.rm = T) - max(target,na.rm = T) < 2.1 ) | (max(cumsum(target),na.rm = T) - max(target,na.rm = T) > 1.9 & max(cumsum(target),na.rm = T) - max(target,na.rm = T) < 2.1) | (max(cumsum(target),na.rm = T) - max(target,na.rm = T) > 1.9 & max(cumsum(target),na.rm = T) - max(target,na.rm = T) < 2.1) | (sum(target, na.rm = T) / max(target, na.rm = T) > 1.9 & sum(target, na.rm = T) / max(target, na.rm = T) < 2.1)){ warning("It appears you have a rounded 'totals' column. Watch for rounding discrepancies.") } if(dir == "up"){ target <- rev(target) } dat$cumulative <- cumsum(ifelse(is.na(target), 0, target)) + target*0 # an na.rm version of cumsum, from https://stackoverflow.com/a/25576972 dat$cumulative[duplicated(dat$cumulative)] <- NA if(dir == "up"){ dat$cumulative <- rev(dat$cumulative) names(dat)[names(dat) %in% "cumulative"] <- "cumulative_up" # Creates NA for repeated values # Assumption that only repeated cumulative value would be a total dat$cumulative_up[duplicated(dat$cumulative_up)] <- NA } dat }
Any suggestions on how one could amend the above code for cumulative percentage?