janitor adorn_ns captures character columns from tibble

adorn_ns captures character columns from tibble

Open jzadra opened this issue 6 years ago • 10 comments

adorn_ns is modifying character columns from a tibble. Note that this does not occur from adorn_percentages or adorn_pct_formatting.

require(janitor)
#> Loading required package: janitor
require(tidyverse)
#> Loading required package: tidyverse

dat <- structure(list(race = c("American Indian And Alaska Native Alone", 
                        "American Indian And Alaska Native Alone", "Asian Alone", "Asian Alone", 
                        "Black Or African American Alone", "Black Or African American Alone", 
                        "Hispanic Or Latino", "Hispanic Or Latino", "Native Hawaiian And Other Pacific Islander Alone", 
                        "Native Hawaiian And Other Pacific Islander Alone", "Some Other Race Alone", 
                        "Some Other Race Alone", "Two Or More Races", "Two Or More Races", 
                        "White Alone", "White Alone", "White Alone, Not Hispanic Or Latino", 
                        "White Alone, Not Hispanic Or Latino"), age = c("All Ages", "Children age 0-5", 
                                                                        "All Ages", "Children age 0-5", "All Ages", "Children age 0-5", 
                                                                        "All Ages", "Children age 0-5", "All Ages", "Children age 0-5", 
                                                                        "All Ages", "Children age 0-5", "All Ages", "Children age 0-5", 
                                                                        "All Ages", "Children age 0-5", "All Ages", "Children age 0-5"
                        ), `Above FPL` = c(1821, 80, 38093, 1716, 123148, 6695, 80869, 
                                           5794, 84, 0, 28696, 1369, 23310, 2583, 269859, 12799, 236366, 
                                           10297), `Below FPL` = c(676, 42, 17121, 521, 40132, 3904, 39454, 
                                                                   3692, 13, 0, 15252, 1181, 7487, 937, 49165, 2398, 34717, 1067
                                           )), .Names = c("race", "age", "Above FPL", "Below FPL"), class = c("tbl_df", 
                                                                                                              "tbl", "data.frame"), row.names = c(NA, -18L))

dat %>% adorn_percentages(denominator = "row") %>% adorn_pct_formatting() %>% adorn_ns()
#>                                              race
#>           American Indian And Alaska Native Alone
#>           American Indian And Alaska Native Alone
#>                                       Asian Alone
#>                                       Asian Alone
#>                   Black Or African American Alone
#>                   Black Or African American Alone
#>                                Hispanic Or Latino
#>                                Hispanic Or Latino
#>  Native Hawaiian And Other Pacific Islander Alone
#>  Native Hawaiian And Other Pacific Islander Alone
#>                             Some Other Race Alone
#>                             Some Other Race Alone
#>                                 Two Or More Races
#>                                 Two Or More Races
#>                                       White Alone
#>                                       White Alone
#>               White Alone, Not Hispanic Or Latino
#>               White Alone, Not Hispanic Or Latino
#>                                  age      Above FPL     Below FPL
#>          All Ages         (All Ages) 72.9%   (1821) 27.1%   (676)
#>  Children age 0-5 (Children age 0-5) 65.6%     (80) 34.4%    (42)
#>          All Ages         (All Ages) 69.0%  (38093) 31.0% (17121)
#>  Children age 0-5 (Children age 0-5) 76.7%   (1716) 23.3%   (521)
#>          All Ages         (All Ages) 75.4% (123148) 24.6% (40132)
#>  Children age 0-5 (Children age 0-5) 63.2%   (6695) 36.8%  (3904)
#>          All Ages         (All Ages) 67.2%  (80869) 32.8% (39454)
#>  Children age 0-5 (Children age 0-5) 61.1%   (5794) 38.9%  (3692)
#>          All Ages         (All Ages) 86.6%     (84) 13.4%    (13)
#>  Children age 0-5 (Children age 0-5)     -      (0)     -     (0)
#>          All Ages         (All Ages) 65.3%  (28696) 34.7% (15252)
#>  Children age 0-5 (Children age 0-5) 53.7%   (1369) 46.3%  (1181)
#>          All Ages         (All Ages) 75.7%  (23310) 24.3%  (7487)
#>  Children age 0-5 (Children age 0-5) 73.4%   (2583) 26.6%   (937)
#>          All Ages         (All Ages) 84.6% (269859) 15.4% (49165)
#>  Children age 0-5 (Children age 0-5) 84.2%  (12799) 15.8%  (2398)
#>          All Ages         (All Ages) 87.2% (236366) 12.8% (34717)
#>  Children age 0-5 (Children age 0-5) 90.6%  (10297)  9.4%  (1067)

Created on 2018-04-16 by the reprex package (v0.2.0).

Apr 16 '18 21:04 jzadra

This is a great bug report, thank you! :+1:

I don't think this use case ever occurred to me, I only thought of adorn_ns() being called on data.frames that are the result of a call to tabyl() which won't have multiple character columns like your dat. But your usage of these functions looks good and I agree that only numeric columns from the original tabyl should get pasted as Ns.

I should be able to implement this, but it may take me a little while.

Apr 17 '18 01:04 sfirke

Thanks!

One thought I had for an intermediate fix, which I noticed someone else mentioned in another comment, would be to allow specification of the columns using ... in the adorn_* functions.

I see how that could cause a lot of added complexity when doing _totals though...

I'm a huge fan of janitor btw!

Apr 17 '18 20:04 jzadra

I was just working on this and thought I had the quick fix above. But the fact that adorn_ns takes a custom data.frame of ns complicates things. Say your custom Ns are character values, like "3.4M" - we do want them appended, but restricting to only numeric columns makes that fail.

I'm unsure how to accommodate those custom Ns and also accomplish the behavior in your example above. My initial idea involves treating automatic Ns (like in your example) differently than custom provided Ns. Which I don't love. It would be:

Automatic/"core" Ns: only append them to columns where those Ns are numeric, solving the problem here
Custom Ns: specify which columns to append to.

In the latter case, I'm not sure if controlling which to columns of custom Ns to append should be done by the user specifying columns of Ns where all values are NA as an indicator that those should be skipped, or by adding a new argument cols_to_modify. I lean toward the columns-of-NAs.

Would love to hear what people think, the design aspect of this is harder than the back end coding.

Jul 06 '18 20:07 sfirke

Hello, I was curious if there were any updates to this bug? I just experienced it myself after creating a three way table using count( ). My first 2 columns are character columns and I am also getting the result where adorn_ns("front") is duplicating the string in my second column within a set of parentheses.

For your request for feedback, personally I've only ever used it for situations where the Automatic/"core" option would be sufficient. Would you be able to implement an option that does both, say cols_to_modify= default is to all numeric or to the columns as selected by the user?

Jan 29 '19 19:01 c-custer

I came across this issue today and was looking for a solution. I think this is probably not reproducible for the package purposes but for those who find this issue one solution is:

library(stringr)
df %>% mutate(character_col = sapply(strsplit(character_col, "\\s\\(" ), [, 1))

In this mutate I am spliting the adorned character_col by " (" then extracting the first chunk from the resulting list.

Mar 03 '20 11:03 jasonpott

I need to go look at the code but it occurred to me today that it shouldn't be too hard to, at the end of this function, replace any columns with alphabetical characters, or columns in the core that are character, with their original values (that is, if a custom value of core is not specified as that could have character values per my comment https://github.com/sfirke/janitor/issues/195#issuecomment-403140569). Quite possible I'm not thinking of something, but that would be fairly simple from an implementation POV.

Mar 03 '20 15:03 sfirke

(the long term fix is to add tidyverse-style column selection to the adorn_ functions, as discussed in #219)

Mar 03 '20 15:03 sfirke

@c-custer thanks for mentioning your use case with count(), that led me to this minimal example to solve:

library(dplyr)
library(janitor)

starwars %>%
  count(gender, homeworld) %>%
  slice(1:5) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting() %>%
  adorn_ns()

There the homeworld column gets inappropriately appended to itself.

Mar 03 '20 17:03 sfirke

Just chiming in--looks like the issue persists if you have more than one character column (see 'var1' in the reprex below). However, if I remove the first 'term' variable, 'var1' is appropriately ignored.

library(janitor)
#> 
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#> 
#>     chisq.test, fisher.test
suppressPackageStartupMessages(library(dplyr))

df <- structure(
  list(
    term = c(
      "F20",
      "F20",
      "F20"
    ),
    var1 = c(
      "X",
      "Y",
      "Z"
    ),
    var2 = c(73L, 34L, 6L),
    var3 = c(3560L, 3L, 71L)
  ),
  row.names = c(
    NA,
    -3L
  ),
  class = c("tbl_df", "tbl", "data.frame")
)

dft <- df %>% 
  adorn_totals(c("col", "row")) %>% 
  adorn_percentages(c("col")) %>% 
  adorn_pct_formatting()

formatted_ns <- attr(dft, "core") %>%
  adorn_totals(c("row", "col")) %>% 
  mutate_if(is.numeric, format, big.mark = ",")

dft %>% adorn_ns(ns = formatted_ns)
#>   term  var1         var2           var3          Total
#>    F20 X (X)  64.6% ( 73)  98.0% (3,560)  97.0% (3,633)
#>    F20 Y (Y)  30.1% ( 34)   0.1% (    3)   1.0% (   37)
#>    F20 Z (Z)   5.3% (  6)   2.0% (   71)   2.1% (   77)
#>  Total - (-) 100.0% (113) 100.0% (3,634) 100.0% (3,747)

^{Created on 2021-01-12 by the reprex package (v0.3.0)}

Jan 13 '21 01:01 daranzolin

Thanks for the reprex and for finding an existing issue! I think the behavior above is not about having multiple character columns, but instead about passing your custom Ns to the ns = argument. If I run dft %>% adorn_ns() in your example it works.

Now that said, I think it's still a bug. The default behavior should still probably be to only operate on numeric columns by default, even when custom Ns are passed, unless that's overridden by passing column names to the ... argument.

~~For now you can use the ... to workaround it but this should get fixed eventually.~~ Hm that appears not to be working.

dft %>% adorn_ns(ns = formatted_ns,,,,,,!(term:var2))

Is not skipping those variables, while without the custom Ns it does:

dft %>% adorn_ns(,,,,,,!(term:var2))

I wonder if that's a different bug. I'll know when I get under the hood.

Jan 13 '21 02:01 sfirke

Hm that appears not to be working. ... wonder if that's a different bug. I'll know when I get under the hood.

I can't replicate this issue, in both cases the var2 variable gets skipped. So there's only the matter of: adorn_ns should not apply to character columns even when custom Ns are passed unless explicitly told to by the ... argument.

Jan 12 '23 02:01 sfirke

Now that said, I think it's still a bug. The default behavior should still probably be to only operate on numeric columns by default, even when custom Ns are passed, unless that's overridden by passing column names to the ... argument.

Upon trying to implement that, I realize: where would find which columns are numeric when custom Ns are being passed? In the (excellent) example above, at the time of calling dft %>% adorn_ns(ns = formatted_ns) the %s are formatted as characters as are the Ns. So there'd be no way for the function to know.

In that situation, the user must specify with ... which cols to adorn. In light of that, I'll leave the codebase as-is and close this. And if someone wants to use custom Ns in a situation like this, which is totally legit, they'll just need to specify the columns:

dft %>% adorn_ns(ns = formatted_ns,,,,,,!(term:var1))

Jan 12 '23 02:01 sfirke

And whew, I closed this issue a day before it turned 2 years old. I should have spun off a separate issue since it makes it look like I left the issue open for more like 5 years 😬

Jan 12 '23 02:01 sfirke

Having recently closed a ~7 year old feature request in my main package, I understand the feeling: 👏 !!!

Jan 12 '23 03:01 billdenney

janitor janitor copied to clipboard

adorn_ns captures character columns from tibble

janitor
janitor copied to clipboard