ggplot2-solutions icon indicating copy to clipboard operation
ggplot2-solutions copied to clipboard

Exercise 9.4.1 #1 tidy the EDAWR::who dataset

Open HossamGhorab opened this issue 3 years ago • 0 comments

https://github.com/kangnade/ggplot2-solutions/blob/e6ef9e3b271599f6e97afc0f8a3f012f276f9385/ggplot2_solutions_chapter9.Rmd#L63

Install the EDAWR package as mentioned in issue #7.

To tidy the who dataset, one must deal with multiple variables in column names. This is done using either the names_sep or the names_pattern arguments of pivot_*(). The latter is the suitable for our problem. The pattern inserted is a regular expression, which I haven't practiced. Fortunately, Vignette("pivot") includes the code to tidy this data set. I borrowed | cheated it, and because the resulting dataset was annoyingly long, I spread the years variable to columns, changing the dimensions of the dataset from 405440 * 8 variables for the one in the vignette → 12264 * 40 after spreading years to columns. This makes the analysis easier I believe. Note: I suppose this was to be included in the vignette, but it wasn't because the who dataset was used as an example on how to use pivot_longer() so it makes since not to spread years using it's cousin pivot_wider()

library(tidyverse); library(EDAWR) vignette("pivot") # see example on Multiple variables in column names

who_cheated <- who %>% pivot_longer( cols = new_sp_m014:new_rel_f65, # typo was forgotten underscore in newrel_f65 names_to = c("diagnosis", "gender", "age"), names_pattern = "new_?(.*)_(.)(.*)", names_transform = list( gender = ~ readr::parse_factor(.x, levels = c("f", "m")), age = ~ readr::parse_factor( .x, levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"), ordered = TRUE ) ), values_to = "count", )

who_cheated_wider <- who_cheated %>% pivot_wider(names_from = year, values_from = count)

Warmly

HossamGhorab avatar Oct 03 '21 16:10 HossamGhorab