haven
haven copied to clipboard
FR: when ingesting data, give all columns labelled in Stata/SPSS/etc the `haven_labelled` class
Background
Currently, when haven
ingests a Stata .dta
file, it preserves Stata data attributes of a column in a different ways depending on the collection of attributes found:
- When a column has both variable and value labels,
haven
addshaven_labelled
andvctrs_vctr
classes and stores the these attributes in thelabel
andlabels
attributes, respectively. - When a column has a variable label only,
haven
does not add any classes and stores the label in thelabel
attribute.
Rationale
This is all well and good. haven
does the right thing of preserving Stata data attributes.
However, sometimes the different methods for preserving those attributes matters.
The main rationale is for a desirable side-effect of labelled columns having additional classes: when two data frames are combined with purrr::list_rbind()
or vctrs::vec_rbind()
(which purrr::list_rbind()
calls), data attributes preserved by haven
are only kept for columns with additional classes.
See also issues here and here.
How haven stores data attributes
Here are some example Stata files: examples_stata_files.zip
Here's the Stata code that generated them
* ==============================================================================
* create file 1
* ==============================================================================
set obs 1
* define data
gen var1 = 1
gen var2 = 2
gen var3 = 3
* attach variable labels for all variables
label variable var1 "Var 1"
label variable var2 "Var 2"
label variable var3 "Var 3"
* attach value labels for var 1 only
label define var1_lbl 1 "Yes" 2 "No"
label values var1 var1_lbl
save "stata_file1.dta", replace
* ==============================================================================
* create file 2
* ==============================================================================
* set up file
clear
set obs 1
* define data
gen var1 = 2
gen var2 = 12
gen var3 = 13
* attach variable labels for all variables
label variable var1 "Var 1"
label variable var2 "Var 2"
label variable var3 "Var 3"
* attach value labels for var 1 only
label define var1_lbl 1 "Yes" 2 "No"
label values var1 var1_lbl
save "stata_file2.dta", replace
Here's how haven captures those Stata data. Note the difference between var1
, which has both variable label and value labels, and var2
/var3
, which only has a variable label.
# load a Stata file
stata_df1 <- haven::read_dta(file = "stata_file1.dta")
# inspect its contents
str(stata_df1)
#> tibble [1 × 3] (S3: tbl_df/tbl/data.frame)
#> $ var1: dbl+lbl [1:1] 1
#> ..@ label : chr "Var 1"
#> ..@ format.stata: chr "%9.0g"
#> ..@ labels : Named num [1:2] 1 2
#> .. ..- attr(*, "names")= chr [1:2] "Yes" "No"
#> $ var2: num 2
#> ..- attr(*, "label")= chr "Var 2"
#> ..- attr(*, "format.stata")= chr "%9.0g"
#> $ var3: num 3
#> ..- attr(*, "label")= chr "Var 3"
#> ..- attr(*, "format.stata")= chr "%9.0g"
# column with variable label and value labels
class(stata_df1$var1)
#> [1] "haven_labelled" "vctrs_vctr" "double"
# column with variable label only
class(stata_df1$var2)
#> [1] "numeric"
Created on 2024-10-10 with reprex v2.1.1
How binding data frames drops attributes of columns without additional classes
# ingest two files with the same columns
stata_df1 <- haven::read_dta(file = "stata_file1.dta")
stata_df2 <- haven::read_dta(file = "stata_file2.dta")
identical(names(stata_df1), names(stata_df2))
#> [1] TRUE
# note: the contents are the same as shown above
# combine the two files
stata_combined <- purrr::list_rbind(list(stata_df1, stata_df2))
str(stata_combined)
#> tibble [2 × 3] (S3: tbl_df/tbl/data.frame)
#> $ var1: dbl+lbl [1:2] 1, 2
#> ..@ label : chr "Var 1"
#> ..@ format.stata: chr "%9.0g"
#> ..@ labels : Named num [1:2] 1 2
#> .. ..- attr(*, "names")= chr [1:2] "Yes" "No"
#> $ var2: num [1:2] 2 12
#> $ var3: num [1:2] 3 13
# column with variable label and value labels
class(stata_combined$var1)
#> [1] "haven_labelled" "vctrs_vctr" "double"
# column with variable label only
class(stata_combined$var2)
#> [1] "numeric"
# same result with other tidyverse methods
stata_combined_dplyr <- dplyr::bind_rows(stata_df1, stata_df2)
str(stata_combined_dplyr)
#> tibble [2 × 3] (S3: tbl_df/tbl/data.frame)
#> $ var1: dbl+lbl [1:2] 1, 2
#> ..@ label : chr "Var 1"
#> ..@ format.stata: chr "%9.0g"
#> ..@ labels : Named num [1:2] 1 2
#> .. ..- attr(*, "names")= chr [1:2] "Yes" "No"
#> $ var2: num [1:2] 2 12
#> $ var3: num [1:2] 3 13
stata_combined_vctrs <- vctrs::vec_rbind(stata_df1, stata_df2)
str(stata_combined_vctrs)
#> tibble [2 × 3] (S3: tbl_df/tbl/data.frame)
#> $ var1: dbl+lbl [1:2] 1, 2
#> ..@ label : chr "Var 1"
#> ..@ format.stata: chr "%9.0g"
#> ..@ labels : Named num [1:2] 1 2
#> .. ..- attr(*, "names")= chr [1:2] "Yes" "No"
#> $ var2: num [1:2] 2 12
#> $ var3: num [1:2] 3 13
Created on 2024-10-10 with reprex v2.1.1
(As an aside, I noticed that this behavior does not occur when there is only 1 column with a variable label but no value labels. In that corner case, that column is given the haven class.)