haven icon indicating copy to clipboard operation
haven copied to clipboard

Allowing two kinds of missing values (kept and not in regressions) and handling labelled variables both as numeric and character

Open bixiou opened this issue 10 months ago • 2 comments

This is a feature request, cf. here.

bixiou avatar Feb 28 '25 17:02 bixiou

Hi @bixiou, can you please explain what feature you are requesting in detail, including why this would be helpful and examples of the desired behaviour? This should all be included in the issue rather linking to external discussion.

Thanks!

gorcha avatar Mar 02 '25 07:03 gorcha

Sure, let me explain.

Analyzing surveys, I'd like to allow treating some missing values as special but not as NA. For example, take a variable "vote" taking values "Left", "Center/Right", "Far right" and (the missing value:) "PNR" (People Not Responding). NA would be reserved for a lack of answer (e.g. the question was not asked in this survey branch) rather than for a PNR answer. Below is an example where missing values could be treated differently, in a plot of answers:

Image

I also have to deal with data that is sometimes best handled as numeric, sometimes as character. For example, take a 5-Likert scale variable (named "scale") from "Strongly oppose" to "Strongly agree", which we can recode from -2 to +2. It's handier to write the following condition "scale > 0" rather than "scale %in% c('Somewhat agree', 'Strongly agree')", especially given that the scale labels can change depending on the question. At the same time, I'd like both ways of writing this condition to work.

Finally, in regressions, I would like to keep missing values by default. For example, lm((scale > 0) ~ vote, data = df) should include vote: PNR as a category. If a labelled variable like "vote" is attached numerical values (say from -1 for Left to +1 for Far right), I think it should be treated as categorical by default in regressions.

Ideally, missing values should be treated as a character while not preventing meaningful numerical comparison. For example, if there is the missing value "PNR" in scale[1], we'd ideally have both scale[1] < 0 and scale[1] >= 0 returning FALSE. However, I suspect this is not possible. We'd then have to set a numerical value to missing values (e.g. -0.1) and be careful when manipulating conditions, e.g. remember to use scale < 0 & !is.missing(scale) instead of scale < 0.

To sum up, here are the behaviors I'd wish:

test <- c(1, NA, -1)
test <- ideal_label(test, 
                    labels = structure(c(0, 1, -1), names = c("No", "Yes", "PNR")), 
                    missing.values = c(NA, -1))
  
as.character(test) # "Yes" NA "PNR"
as.numeric(test) # 1 NA -1 (or NaN or NA for test[3])
test %in% 1 # TRUE FALSE FALSE
test == 1 # TRUE NA FALSE
test < 1 # FALSE NA TRUE (or NA for test[3])
test %in% "Yes" # TRUE FALSE FALSE
test == "Yes" # TRUE NA FALSE
is.na(test) # FALSE TRUE FALSE
is.missing(test) # FALSE TRUE TRUE 
lm(c(T, T, T) ~ test)$rank # 2 (i.e., keeps missing values that are not NA)
df <- data.frame(test = test, true = c(T, T, T))
lm(true ~ test, data = df)$rank

The package memisc respects almost all these behaviors, but not the last one. It would be great to find a package that respects all of them.

bixiou avatar Mar 02 '25 15:03 bixiou

Hi @bixiou,

This is well out of the scope of haven. This package is intended as a too to load files from statistical packages and brings in related metadata as part of this. There are some related missing value features to provide a simple mimicking of the packages we're reading files from, but this request is a very specific use case that isn't really in our realm.

Best of luck finding something!

gorcha avatar Nov 23 '25 14:11 gorcha