forcats
forcats copied to clipboard
factor lump for ordinal factors
Hi Hadley,
Do you have thoughts on creating an analogue (or generic) of fct_lump() for ordinal factors?
The utility I am looking for is the ability to keep contiguous levels together. A simple solution for this could be to lump ordinal levels directionally (either from the left or right), instead of by frequency.
Here is an example of the current behaviour:
set.seed(6)
f <- ordered(sample(letters[1:5], 10, replace = TRUE))
fct_count(f)
#> # A tibble: 5 × 2
#> f n
#> <ord> <int>
#> 1 a 1
#> 2 b 2
#> 3 c 1
#> 4 d 2
#> 5 e 4
fct_lump(f, 3) # non-contiguous lumping
#> [1] d e b b e e e d Other Other
#> Levels: b < d < e < Other
I imagine that in many cases, lumping categories a and c such that they are placed above b, d, and e will not created a desired result.
Here is what I have in mind for ord_lump():
ord_lump(f, 3, from = "left")
#> [1] Other Other b b Other Other Other Other c a
#> Levels: a < b < c < Other
If you think this is worth it, I can propose a PR. An alternative would be to work the change into fct_lump by adding type = c("freq", "left", "right") into the formals. Let me know what you think.
Cheers,
Rafael
Hmmmm, this seems like it will need quite a different algorithm to fct_lump so it should be a different function. I think you need to spell out the general approach here a bit more before it's worth doing a PR.
I have a related use case where I want to symmetrically collapse ordered factors, grouping the top and bottom n levels. Say, converting responses on a Likert survey item into groups of hi-med-lo. Rafael's proposal might partially meet my needs. It feels like a new function separate from fct_lump.
Sounds like there is some renewed interest in this.
As a general approach, the same formals can be kept as fct_lump, with the exception of "ties.method", which can be dropped since there are no ties in an ordered factor. For positive n or p, the levels would lumped together, starting from the left or right (as specified). For negative n (or p), the n levels would be retained from the left or right, and others lumped. The method should be general enough to apply intuitively on unordered factors too, based on the default ordering given by levels().
The formals would boil down to:
ord_lump(f, n, prop, from = c("left", "right"), other_level = "Other")
with the following behaviour:
(x <- factor(LETTERS[1:8], ordered = TRUE))
#> [1] A B C D E F G H
#> Levels: A < B < C < D < E < F < G < H
ord_lump(x, 3, from = "right")
#> [1] A B C Other Other Other Other Other
#> Levels: A < B < C < Other
ord_lump(x, -3, from = "right")
#> [1] A B C D E Other Other Other
#> Levels: A < B < C < D < E < Other
ord_lump(x, 3, from = "left")
#> [1] Other Other Other D E F G H
#> Levels: Other < D < E < F < G < H
ord_lump(x, prop = 0.25, from = "left")
#> [1] Other Other C D E F G H
#> Levels: Other < C < D < E < F < G < H
# unordered factor
y <- factor(LETTERS[1:8])
ord_lump(y, 3, from = "left")
#> [1] Other Other Other D E F G H
#> Levels: Other D E F G H
@sfirke's use case might be tricky to implement intuitively, since the lumping would need to occur from both sides, and a tie-breaking rule chosen when the middle interval is even. Maybe best left as a two-step process relying on a lumping function?
I could see that ord_lump function above working with my use case. If the user can pass both arguments to "from", then can they supply two arguments to "into" (in the style of tidyr::separate, I prefer to "other_level" here)?
Then using data from the above example, here's how I imagine it:
ord_lump(x, -3, from = c("left", "right"), into = c("top", "bottom"))
#> [1] top top top D E bottom bottom bottom
#> Levels: top < D < E < bottom
If that's too complex, I think I could hack what I want together from multiple chained ord_lump as calls proposed above.
I was interested in this behaviour for tabular data, where I want to be able to collapse levels with small counts into the level closest to them (as a privacy/disclosure control) protection and found this discussion. In case someone else comes here with a similar problem, please see below for some crude code. It takes the nearest level and sums the count across the levels and relabels the new summarised level. It depends on dplyr and purrr.
library(tidyverse)
## Create example data
mydf <- tibble(x = sample(letters[1:3], 50, replace = TRUE), n = rpois(50, 5))
mydf <- mydf %>%
group_by(x) %>%
mutate(i = seq_along(x)) %>%
ungroup() %>%
arrange(x, i)
LumpByN <- function(i, n, n_threshold =5, fail_iters = 10){
## for an ordered variable, i it lumps together the closest ordered levels when they have
## a count of n_threshold or lower
mydf <- tibble(n = n, i = i) %>%
arrange(i)
## Loop through until no values less than 5 found
count <- 0
while(any(mydf$n <=5) & count < fail_iters){
# print(count)
lag_lead <- if( (count %%2) ==1) function(x) lag(x, default = FALSE) else function(x) lead(x, default = FALSE)
mydf <- mydf %>%
mutate(n_small = n<=n_threshold,
i_match = if_else(lag_lead(n_small), lag_lead(i), i)) %>%
group_by(i_match) %>%
summarise(n = sum(n),
i = paste(i, collapse = " or ")) %>%
ungroup() %>%
select(-i_match) %>%
mutate(i_order = as.integer(str_extract(i, "[0-9]{1,}\\b"))) %>%
arrange(i_order)
count <- count +1
}
mydf
}
## Apply to a group
mydf_slct <- mydf %>%
filter(x == "a")
LumpByN(mydf_slct$i, mydf_slct$n)
## Apply to multiple groups
mydf2 <- mydf %>%
group_by(x) %>%
nest()
map(mydf2$data, ~ LumpByN(.x$i, .x$n))