readstata13 icon indicating copy to clipboard operation
readstata13 copied to clipboard

Support labelled vectors

Open jackobailey opened this issue 3 years ago • 7 comments

At the moment, readstata13 is the only package available that can write compressed Stata .dta files. But the package does not play nicely with labelled vectors from the tidyverse labelled package. Instead, it treats them as numeric and, thus, removes all value and variable labels when it writes them to disk.

Any chance that readstata13 could support labelled vectors?

jackobailey avatar Feb 13 '21 09:02 jackobailey

Could you give a small usage example? I have no experience with the labelled package.

sjewo avatar Feb 13 '21 09:02 sjewo

Of course, see below.


# Load packages

library(tidyverse)
library(labelled)
library(readstata13)
library(here)


# Create data frame with labelled vectors

dta <- 
  data.frame(
    lab_vct =
      sample(c(1:3, 999), 100, replace = T) %>% 
      labelled(
        labels =
          c(
            "a" = 1,
            "b" = 2,
            "c" = 3,
            "dk" = 999
          )
      )
  )


# Add variable label

var_label(dta$lab_vct) <- "Variable information here"


# Write to disk using save.dta13 (includes no variable/value labels)

save.dta13(data = dta, file = here("file.dta"), compress = T)

When we open the file in Stata we see that there are no variable or value labels:

Screenshot 2021-02-13 at 10 19 27 Screenshot 2021-02-13 at 10 19 45

Whereas if we save with write_dta from the haven package we can see them (though unfortunately cannot compress the Stata file, which can yield huge file sizes where data sets are large).

Screenshot 2021-02-13 at 10 20 19 Screenshot 2021-02-13 at 10 20 26

Finally, I would add that labelled vectors are often used when creating data for Stata as they allow one to specify specific label-value combinations (e.g. that "don't know" = 999) as opposed to factors numbering everything sequentially. They also allow you to add variable labels too.

jackobailey avatar Feb 13 '21 10:02 jackobailey

It might be a good idea to add some code for labelled columns. I want to add some functions for working with labels anyway.

In the meantime you could just prepare you data manually before exporting:

# Load packages

library(tidyverse)
library(labelled)
library(readstata13)
library(here)


# Create data frame with labelled vectors

dta <- 
  data.frame(
    lab_vct =
      sample(c(1:3, 999), 100, replace = T) %>% 
      labelled(
        labels =
          c(
            "a" = 1,
            "b" = 2,
            "c" = 3,
            "dk" = 999
          )
      )
  )

# Add variable label

var_label(dta$lab_vct) <- "Variable information here"


# Write to disk using save.dta13 (includes no variable/value labels)

save.dta13(data = dta, file = here("file.dta"), compress = T)

# Add another labelled variable
dta$num1 <- 1:100
var_label(dta$num1) <- "Numeric variable"

# Add a variable without label
dta$num2 <- 1:100

# Get vector of labelled variables
labeld_vars <- sapply(dta, is.labelled)

# And convert them to a factor
for(v in names(labeld_vars)[labeld_vars]) {
  dta[[v]] <- to_factor(dta[[v]])
}

# Get variable labels
var_labs <- var_label(dta)

# Replace missing labels with ""
var_labs[sapply(var_labs, is.null)] <- ""

# Unlist and order variable labels
var_labs <- unlist(var_labs)[names(dta)]

# Save variable labels as attribute
attr(dta, "var.labels") <- var_labs

# Save dta-file
save.dta13(data = dta, file = here("file_with_labels.dta"), compress = T)

sjewo avatar Feb 13 '21 11:02 sjewo

Thanks. I agree that supporting labelled vectors would be useful. While I could convert the labelled vectors to factors first, it's not useful in this case as it changes the numbers to be sequential. However, we need, for example, "Don't know" to be 999 in all cases and can't do this with factors that number each value label sequentially.

jackobailey avatar Feb 13 '21 11:02 jackobailey

I think it is also possible to keep the numeric codes. I’ll take a look into that.

sjewo avatar Feb 13 '21 11:02 sjewo

I did a rough implementation of labelled vectors in d8af2077bd8ba74c05795d2b53fb2be7bf9657bf . The code will probably change in the future, but you might give it a try:

# install from readstata13 from branch labels
remotes::install_github("sjewo/readstata13@label")

library(tidyverse)
library(labelled)
library(readstata13)
library(here)


# Create data frame with labelled vectors

dta <- 
  data.frame(
    lab_vct =
      sample(c(1:3, 999), 100, replace = T) %>% 
      labelled(
        labels =
          c(
            "a" = 1,
            "b" = 2,
            "c" = 3,
            "dk" = 999
          )
      )
  )

# Add variable label
var_label(dta$lab_vct) <- "Variable information here"

# Get variable labels
var_labs <- var_label(dta)

# Replace missing labels with ""
var_labs[sapply(var_labs, is.null)] <- ""

# Unlist and order variable labels
var_labs <- unlist(var_labs)[names(dta)]

# Save variable labels as attribute
attr(dta, "var.labels") <- var_labs 

# Write to disk using save.dta13 (includes now variable and value labels)
save.dta13(data = dta, file = here("file.dta"), compress = T)

sjewo avatar Feb 13 '21 16:02 sjewo

Thanks for the package. It would be great if (missing) values such as 999 (don't know) could be recoded into Stata missing values such as .a Ideally, the user could supply the recoding rules: 999 = .a, 998 = .b etc

dusadrian avatar Mar 23 '21 21:03 dusadrian