ggplot2-solutions
ggplot2-solutions copied to clipboard
Exercise 9.4.1 #1 tidy the EDAWR::who dataset
https://github.com/kangnade/ggplot2-solutions/blob/e6ef9e3b271599f6e97afc0f8a3f012f276f9385/ggplot2_solutions_chapter9.Rmd#L63
Install the EDAWR package as mentioned in issue #7.
To tidy the who dataset, one must deal with multiple variables in column names. This is done using either the names_sep
or the names_pattern
arguments of pivot_*()
. The latter is the suitable for our problem. The pattern inserted is a regular expression, which I haven't practiced.
Fortunately, Vignette("pivot")
includes the code to tidy this data set. I borrowed | cheated it, and because the resulting dataset was annoyingly long, I spread the years variable to columns, changing the dimensions of the dataset from 405440 * 8 variables for the one in the vignette → 12264 * 40 after spreading years to columns. This makes the analysis easier I believe.
Note: I suppose this was to be included in the vignette, but it wasn't because the who dataset was used as an example on how to use pivot_longer()
so it makes since not to spread years using it's cousin pivot_wider()
library(tidyverse); library(EDAWR)
vignette("pivot") # see example on Multiple variables in column names
who_cheated <- who %>% pivot_longer(
cols = new_sp_m014:new_rel_f65, # typo was forgotten underscore in newrel_f65
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
names_transform = list(
gender = ~ readr::parse_factor(.x, levels = c("f", "m")),
age = ~ readr::parse_factor(
.x,
levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"),
ordered = TRUE
)
),
values_to = "count",
)
who_cheated_wider <- who_cheated %>% pivot_wider(names_from = year, values_from = count)
Warmly