gssr
gssr copied to clipboard
Wrong variable labels
Thanks for this very helpful package.
I noticed that some variable labels in gss_data are wrong (i.e. don't match the codebook) - for instance violjews
is coded the wrong way round. That obviously risks leading to mistakes in analyses - should variable labels maybe be taken directly from the codebook?
Hm. The variable labels in gss_all
are imported along with the values by read_dta()
from the dta
file provided by NORC. There isn't any manual processing of labels by me. So the issue is either at the point where the .dta
file gets converted by haven or it's in the dta
file itself. This seems to affect variables—like the "prone to violence" scale—where the endpoints are labeled but the intermediate points are not. But not all of them. The labels
attribute appears to be coming from a different question or kind of question.
With a tmp
file that's just the 1972-2021 cumulative R2 dta
file,
attr(tmp$violwhts, "label")
#> [1] "violent - not violent"
attr(tmp$violwhts, "labels")
#> 1 - not at all severe 10 - very severe
#> 1 10
attr(tmp$violblks, "label")
#> [1] "violent - not violent"
attr(tmp$violblks, "labels")
#> 1 - not at all severe 10 - very severe
#> 1 10
attr(tmp$violjews, "label")
#> [1] "violent - not violent"
attr(tmp$violjews, "labels")
#> 1 - not at all severe 10 - very severe
#> 1 10
attr(tmp$wlthwhts, "label")
#> [1] "rich - poor"
attr(tmp$wlthwhts, "labels")
#> 1 hour 97 hours or more
#> 1 97
attr(tmp$wlthjews, "label")
#> [1] "rich - poor"
attr(tmp$wlthjews, "labels")
#> 1 hour 97 hours or more
#> 1 97
attr(tmp$patrwhts, "label")
#> [1] "unpatriotic - patriotic"
attr(tmp$patrwhts, "labels")
#> NULL
attr(tmp$influwht, "label")
#> [1] "influence of whites"
attr(tmp$influwht, "labels")
#> NULL
attr(tmp$influblk, "label")
#> [1] "influence of blacks"
attr(tmp$influblk, "labels")
#> NULL
The basic label
attribute is right but in general labels
are not. I'm not sure where it's getting them from in the first place. I'll investigate this a bit further.
OK, while my first thought was that this is was a problem with read_dta()
, I now believe it is an issue with the current release of the GSS data file from NORC. Some number of variables (amongst them, at least the viol
and wlth
questions above) have been assigned the wrong labels for their scales during the process of creating the dta
file. That is, e.g., for violjews
the overall label
attribute is correct (violent - not violent
) with the implicit starting point of 1 for violent. But the labels
attribute for the values of that response scale is wrong: it should not run from 1 - not at all severe 10 - very severe
at all. For one thing the response scale only has a range of 1 to 7. The same problem, even more obviously, is true of the wlth
series, where rich - poor
is correct but the associated labels 1 hour 97 hours or more
are nonsensical.
The values in the gss_all
data are right, and the overall label is correct. And so also is the summary in gss_docs
although right now this table is for 1972-2018 only, not 1972-2021, I need to update it. (This is a separate issue just internal to the gssr
package, not the NORC dta
file obviously.)
gss_get_marginals("violjews")
#> # A tibble: 11 × 6
#> variable percent n value label id
#> <chr> <dbl> <int> <chr> <chr> <chr>
#> 1 violjews 1.8 44 1 "VIOLENCE-PRONE" VIOLJEWS
#> 2 violjews 3.1 75 2 "" VIOLJEWS
#> 3 violjews 7.2 174 3 "" VIOLJEWS
#> 4 violjews 39.3 951 4 "" VIOLJEWS
#> 5 violjews 19 460 5 "" VIOLJEWS
#> 6 violjews 20.4 493 6 "" VIOLJEWS
#> 7 violjews 9.3 224 7 "NOT VIOLENCE-PRONE" VIOLJEWS
#> 8 violjews NA 62044 0 "IAP" VIOLJEWS
#> 9 violjews NA 307 8 "DONT KNOW" VIOLJEWS
#> 10 violjews NA 42 9 <NA> VIOLJEWS
#> 11 violjews 100 64814 <NA> "Total" VIOLJEWS
You can see that the label here runs from violent to not violent, as the overall label attribute in the dataset suggests it should. Similarly, for wlth
:
attr(tmp$wlthwhts, "label")
#> [1] "rich - poor"
attr(tmp$wlthwhts, "labels")
#> 1 hour 97 hours or more
#> 1 97
# This function relies on the 1972-2018 GSS codebook
# so the numbers aren't the same, but that's irrelevant here
gss_get_marginals("wlthwhts")
#> # A tibble: 13 × 6
#> variable percent n value label id
#> <chr> <dbl> <int> <chr> <chr> <chr>
#> 1 wlthwhts 4.9 991 1 "RICH" WLTHWHTS
#> 2 wlthwhts 9 1803 2 "" WLTHWHTS
#> 3 wlthwhts 27.6 5545 3 "" WLTHWHTS
#> 4 wlthwhts 49.8 9997 4 "" WLTHWHTS
#> 5 wlthwhts 7.1 1422 5 "" WLTHWHTS
#> 6 wlthwhts 1.1 221 6 "" WLTHWHTS
#> 7 wlthwhts 0.4 80 7 "POOR" WLTHWHTS
#> 8 wlthwhts 0.1 11 98 "" WLTHWHTS
#> 9 wlthwhts 0 7 99 "" WLTHWHTS
#> 10 wlthwhts NA 44189 0 "IAP" WLTHWHTS
#> 11 wlthwhts NA 425 8 "DONT KNOW" WLTHWHTS
#> 12 wlthwhts NA 123 9 <NA> WLTHWHTS
#> 13 wlthwhts 100 64814 <NA> "Total" WLTHWHTS
Again, we can see rich - poor
in the overall label, corresponding to the coding in the 1972-2018 doc file, but a nonsensical 1 hour 97 hours or more
in the labels
attribute that labels the scale in the current 1972-2021 version of the .dta
file.
To confirm this is an issue in the .dta
file, I took a look at it directly in Stata. If you fire up the current release of the 1972-2021 GSS in Stata you'll see the same miscoding there as well. A Stata screenshot to confirm:

I think you can see this issue in the GSS Data Explorer on the NORC website, as well (e.g. if you try to crosstabulate some of these variables the labels don't work properly).
In summary I think that, for some subset of GSS variables in the current release of the cumulative data file, something has gone wrong during the Data Dictionary process that links labels for various scales to the relevant questions using those scales. I'll try to get in touch with the people at NORC and alert them to the issue.
Some more documentation of the error just for the sake of it. Here are some sample incorrectly-labeled tabulations from the Berkeley SDA archive of the 2021 GSS Cumulative file. (https://sda.berkeley.edu/sdaweb/analysis/?dataset=gss21)

