janitor
janitor copied to clipboard
adorn_ns captures character columns from tibble
adorn_ns is modifying character columns from a tibble. Note that this does not occur from adorn_percentages or adorn_pct_formatting.
require(janitor)
#> Loading required package: janitor
require(tidyverse)
#> Loading required package: tidyverse
dat <- structure(list(race = c("American Indian And Alaska Native Alone",
"American Indian And Alaska Native Alone", "Asian Alone", "Asian Alone",
"Black Or African American Alone", "Black Or African American Alone",
"Hispanic Or Latino", "Hispanic Or Latino", "Native Hawaiian And Other Pacific Islander Alone",
"Native Hawaiian And Other Pacific Islander Alone", "Some Other Race Alone",
"Some Other Race Alone", "Two Or More Races", "Two Or More Races",
"White Alone", "White Alone", "White Alone, Not Hispanic Or Latino",
"White Alone, Not Hispanic Or Latino"), age = c("All Ages", "Children age 0-5",
"All Ages", "Children age 0-5", "All Ages", "Children age 0-5",
"All Ages", "Children age 0-5", "All Ages", "Children age 0-5",
"All Ages", "Children age 0-5", "All Ages", "Children age 0-5",
"All Ages", "Children age 0-5", "All Ages", "Children age 0-5"
), `Above FPL` = c(1821, 80, 38093, 1716, 123148, 6695, 80869,
5794, 84, 0, 28696, 1369, 23310, 2583, 269859, 12799, 236366,
10297), `Below FPL` = c(676, 42, 17121, 521, 40132, 3904, 39454,
3692, 13, 0, 15252, 1181, 7487, 937, 49165, 2398, 34717, 1067
)), .Names = c("race", "age", "Above FPL", "Below FPL"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -18L))
dat %>% adorn_percentages(denominator = "row") %>% adorn_pct_formatting() %>% adorn_ns()
#> race
#> American Indian And Alaska Native Alone
#> American Indian And Alaska Native Alone
#> Asian Alone
#> Asian Alone
#> Black Or African American Alone
#> Black Or African American Alone
#> Hispanic Or Latino
#> Hispanic Or Latino
#> Native Hawaiian And Other Pacific Islander Alone
#> Native Hawaiian And Other Pacific Islander Alone
#> Some Other Race Alone
#> Some Other Race Alone
#> Two Or More Races
#> Two Or More Races
#> White Alone
#> White Alone
#> White Alone, Not Hispanic Or Latino
#> White Alone, Not Hispanic Or Latino
#> age Above FPL Below FPL
#> All Ages (All Ages) 72.9% (1821) 27.1% (676)
#> Children age 0-5 (Children age 0-5) 65.6% (80) 34.4% (42)
#> All Ages (All Ages) 69.0% (38093) 31.0% (17121)
#> Children age 0-5 (Children age 0-5) 76.7% (1716) 23.3% (521)
#> All Ages (All Ages) 75.4% (123148) 24.6% (40132)
#> Children age 0-5 (Children age 0-5) 63.2% (6695) 36.8% (3904)
#> All Ages (All Ages) 67.2% (80869) 32.8% (39454)
#> Children age 0-5 (Children age 0-5) 61.1% (5794) 38.9% (3692)
#> All Ages (All Ages) 86.6% (84) 13.4% (13)
#> Children age 0-5 (Children age 0-5) - (0) - (0)
#> All Ages (All Ages) 65.3% (28696) 34.7% (15252)
#> Children age 0-5 (Children age 0-5) 53.7% (1369) 46.3% (1181)
#> All Ages (All Ages) 75.7% (23310) 24.3% (7487)
#> Children age 0-5 (Children age 0-5) 73.4% (2583) 26.6% (937)
#> All Ages (All Ages) 84.6% (269859) 15.4% (49165)
#> Children age 0-5 (Children age 0-5) 84.2% (12799) 15.8% (2398)
#> All Ages (All Ages) 87.2% (236366) 12.8% (34717)
#> Children age 0-5 (Children age 0-5) 90.6% (10297) 9.4% (1067)
Created on 2018-04-16 by the reprex package (v0.2.0).
This is a great bug report, thank you! :+1:
I don't think this use case ever occurred to me, I only thought of adorn_ns()
being called on data.frames that are the result of a call to tabyl()
which won't have multiple character columns like your dat
. But your usage of these functions looks good and I agree that only numeric columns from the original tabyl should get pasted as Ns.
I should be able to implement this, but it may take me a little while.
Thanks!
One thought I had for an intermediate fix, which I noticed someone else mentioned in another comment, would be to allow specification of the columns using ...
in the adorn_* functions.
I see how that could cause a lot of added complexity when doing _totals
though...
I'm a huge fan of janitor
btw!
I was just working on this and thought I had the quick fix above. But the fact that adorn_ns
takes a custom data.frame of ns
complicates things. Say your custom Ns are character values, like "3.4M" - we do want them appended, but restricting to only numeric columns makes that fail.
I'm unsure how to accommodate those custom Ns and also accomplish the behavior in your example above. My initial idea involves treating automatic Ns (like in your example) differently than custom provided Ns. Which I don't love. It would be:
- Automatic/"core" Ns: only append them to columns where those Ns are numeric, solving the problem here
- Custom Ns: specify which columns to append to.
In the latter case, I'm not sure if controlling which to columns of custom Ns to append should be done by the user specifying columns of Ns where all values are NA as an indicator that those should be skipped, or by adding a new argument cols_to_modify
. I lean toward the columns-of-NAs.
Would love to hear what people think, the design aspect of this is harder than the back end coding.
Hello, I was curious if there were any updates to this bug? I just experienced it myself after creating a three way table using count( )
. My first 2 columns are character columns and I am also getting the result where adorn_ns("front")
is duplicating the string in my second column within a set of parentheses.
For your request for feedback, personally I've only ever used it for situations where the Automatic/"core" option would be sufficient. Would you be able to implement an option that does both, say cols_to_modify=
default is to all numeric or to the columns as selected by the user?
I came across this issue today and was looking for a solution. I think this is probably not reproducible for the package purposes but for those who find this issue one solution is:
library(stringr)
df %>% mutate(character_col = sapply(strsplit(character_col, "\\s\\(" ),
[, 1))
In this mutate I am spliting the adorned character_col by " (" then extracting the first chunk from the resulting list.
I need to go look at the code but it occurred to me today that it shouldn't be too hard to, at the end of this function, replace any columns with alphabetical characters, or columns in the core
that are character, with their original values (that is, if a custom value of core
is not specified as that could have character values per my comment https://github.com/sfirke/janitor/issues/195#issuecomment-403140569). Quite possible I'm not thinking of something, but that would be fairly simple from an implementation POV.
(the long term fix is to add tidyverse-style column selection to the adorn_
functions, as discussed in #219)
@c-custer thanks for mentioning your use case with count()
, that led me to this minimal example to solve:
library(dplyr)
library(janitor)
starwars %>%
count(gender, homeworld) %>%
slice(1:5) %>%
adorn_percentages("col") %>%
adorn_pct_formatting() %>%
adorn_ns()
There the homeworld column gets inappropriately appended to itself.
Just chiming in--looks like the issue persists if you have more than one character column (see 'var1' in the reprex below). However, if I remove the first 'term' variable, 'var1' is appropriately ignored.
library(janitor)
#>
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
suppressPackageStartupMessages(library(dplyr))
df <- structure(
list(
term = c(
"F20",
"F20",
"F20"
),
var1 = c(
"X",
"Y",
"Z"
),
var2 = c(73L, 34L, 6L),
var3 = c(3560L, 3L, 71L)
),
row.names = c(
NA,
-3L
),
class = c("tbl_df", "tbl", "data.frame")
)
dft <- df %>%
adorn_totals(c("col", "row")) %>%
adorn_percentages(c("col")) %>%
adorn_pct_formatting()
formatted_ns <- attr(dft, "core") %>%
adorn_totals(c("row", "col")) %>%
mutate_if(is.numeric, format, big.mark = ",")
dft %>% adorn_ns(ns = formatted_ns)
#> term var1 var2 var3 Total
#> F20 X (X) 64.6% ( 73) 98.0% (3,560) 97.0% (3,633)
#> F20 Y (Y) 30.1% ( 34) 0.1% ( 3) 1.0% ( 37)
#> F20 Z (Z) 5.3% ( 6) 2.0% ( 71) 2.1% ( 77)
#> Total - (-) 100.0% (113) 100.0% (3,634) 100.0% (3,747)
Created on 2021-01-12 by the reprex package (v0.3.0)
Thanks for the reprex and for finding an existing issue! I think the behavior above is not about having multiple character columns, but instead about passing your custom Ns to the ns =
argument. If I run dft %>% adorn_ns()
in your example it works.
Now that said, I think it's still a bug. The default behavior should still probably be to only operate on numeric columns by default, even when custom Ns are passed, unless that's overridden by passing column names to the ...
argument.
~~For now you can use the ...
to workaround it but this should get fixed eventually.~~ Hm that appears not to be working.
dft %>% adorn_ns(ns = formatted_ns,,,,,,!(term:var2))
Is not skipping those variables, while without the custom Ns it does:
dft %>% adorn_ns(,,,,,,!(term:var2))
I wonder if that's a different bug. I'll know when I get under the hood.
Hm that appears not to be working. ... wonder if that's a different bug. I'll know when I get under the hood.
I can't replicate this issue, in both cases the var2 variable gets skipped. So there's only the matter of: adorn_ns should not apply to character columns even when custom Ns are passed unless explicitly told to by the ...
argument.
Now that said, I think it's still a bug. The default behavior should still probably be to only operate on numeric columns by default, even when custom Ns are passed, unless that's overridden by passing column names to the ... argument.
Upon trying to implement that, I realize: where would find which columns are numeric when custom Ns are being passed? In the (excellent) example above, at the time of calling dft %>% adorn_ns(ns = formatted_ns)
the %s are formatted as characters as are the Ns. So there'd be no way for the function to know.
In that situation, the user must specify with ...
which cols to adorn. In light of that, I'll leave the codebase as-is and close this. And if someone wants to use custom Ns in a situation like this, which is totally legit, they'll just need to specify the columns:
dft %>% adorn_ns(ns = formatted_ns,,,,,,!(term:var1))
And whew, I closed this issue a day before it turned 2 years old. I should have spun off a separate issue since it makes it look like I left the issue open for more like 5 years 😬
Having recently closed a ~7 year old feature request in my main package, I understand the feeling: 👏 !!!