gssr icon indicating copy to clipboard operation
gssr copied to clipboard

Wrong variable labels

Open LukasWallrich opened this issue 2 years ago • 3 comments

Thanks for this very helpful package.

I noticed that some variable labels in gss_data are wrong (i.e. don't match the codebook) - for instance violjews is coded the wrong way round. That obviously risks leading to mistakes in analyses - should variable labels maybe be taken directly from the codebook?

LukasWallrich avatar Aug 23 '22 14:08 LukasWallrich

Hm. The variable labels in gss_all are imported along with the values by read_dta() from the dta file provided by NORC. There isn't any manual processing of labels by me. So the issue is either at the point where the .dta file gets converted by haven or it's in the dta file itself. This seems to affect variables—like the "prone to violence" scale—where the endpoints are labeled but the intermediate points are not. But not all of them. The labels attribute appears to be coming from a different question or kind of question.

With a tmp file that's just the 1972-2021 cumulative R2 dta file,

attr(tmp$violwhts, "label")
#> [1] "violent - not violent"
attr(tmp$violwhts, "labels")
#> 1 - not at all severe      10 - very severe 
#>                     1                    10

attr(tmp$violblks, "label")
#> [1] "violent - not violent"
attr(tmp$violblks, "labels")
#> 1 - not at all severe      10 - very severe 
#>                     1                    10

attr(tmp$violjews, "label")
#> [1] "violent - not violent"
attr(tmp$violjews, "labels")
#> 1 - not at all severe      10 - very severe 
#>                     1                    10


attr(tmp$wlthwhts, "label")
#> [1] "rich - poor"
attr(tmp$wlthwhts, "labels")
#>           1 hour 97 hours or more 
#>                1               97

attr(tmp$wlthjews, "label")
#> [1] "rich - poor"
attr(tmp$wlthjews, "labels")
#>           1 hour 97 hours or more 
#>                1               97

attr(tmp$patrwhts, "label")
#> [1] "unpatriotic - patriotic"
attr(tmp$patrwhts, "labels")
#> NULL

attr(tmp$influwht, "label")
#> [1] "influence of whites"
attr(tmp$influwht, "labels")
#> NULL

attr(tmp$influblk, "label")
#> [1] "influence of blacks"
attr(tmp$influblk, "labels")
#> NULL

The basic label attribute is right but in general labels are not. I'm not sure where it's getting them from in the first place. I'll investigate this a bit further.

kjhealy avatar Aug 23 '22 18:08 kjhealy

OK, while my first thought was that this is was a problem with read_dta(), I now believe it is an issue with the current release of the GSS data file from NORC. Some number of variables (amongst them, at least the viol and wlth questions above) have been assigned the wrong labels for their scales during the process of creating the dta file. That is, e.g., for violjews the overall label attribute is correct (violent - not violent) with the implicit starting point of 1 for violent. But the labels attribute for the values of that response scale is wrong: it should not run from 1 - not at all severe 10 - very severe at all. For one thing the response scale only has a range of 1 to 7. The same problem, even more obviously, is true of the wlth series, where rich - poor is correct but the associated labels 1 hour 97 hours or more are nonsensical.

The values in the gss_all data are right, and the overall label is correct. And so also is the summary in gss_docs although right now this table is for 1972-2018 only, not 1972-2021, I need to update it. (This is a separate issue just internal to the gssr package, not the NORC dta file obviously.)

gss_get_marginals("violjews") 
#> # A tibble: 11 × 6
#>    variable percent     n value label                id      
#>    <chr>      <dbl> <int> <chr> <chr>                <chr>   
#>  1 violjews     1.8    44 1     "VIOLENCE-PRONE"     VIOLJEWS
#>  2 violjews     3.1    75 2     ""                   VIOLJEWS
#>  3 violjews     7.2   174 3     ""                   VIOLJEWS
#>  4 violjews    39.3   951 4     ""                   VIOLJEWS
#>  5 violjews    19     460 5     ""                   VIOLJEWS
#>  6 violjews    20.4   493 6     ""                   VIOLJEWS
#>  7 violjews     9.3   224 7     "NOT VIOLENCE-PRONE" VIOLJEWS
#>  8 violjews    NA   62044 0     "IAP"                VIOLJEWS
#>  9 violjews    NA     307 8     "DONT KNOW"          VIOLJEWS
#> 10 violjews    NA      42 9      <NA>                VIOLJEWS
#> 11 violjews   100   64814 <NA>  "Total"              VIOLJEWS

You can see that the label here runs from violent to not violent, as the overall label attribute in the dataset suggests it should. Similarly, for wlth:

attr(tmp$wlthwhts, "label")
#> [1] "rich - poor"
attr(tmp$wlthwhts, "labels")
#>           1 hour 97 hours or more 
#>                1               97

# This function relies on the 1972-2018 GSS codebook 
# so the numbers aren't the same, but that's irrelevant here
gss_get_marginals("wlthwhts") 
#> # A tibble: 13 × 6
#>    variable percent     n value label       id      
#>    <chr>      <dbl> <int> <chr> <chr>       <chr>   
#>  1 wlthwhts     4.9   991 1     "RICH"      WLTHWHTS
#>  2 wlthwhts     9    1803 2     ""          WLTHWHTS
#>  3 wlthwhts    27.6  5545 3     ""          WLTHWHTS
#>  4 wlthwhts    49.8  9997 4     ""          WLTHWHTS
#>  5 wlthwhts     7.1  1422 5     ""          WLTHWHTS
#>  6 wlthwhts     1.1   221 6     ""          WLTHWHTS
#>  7 wlthwhts     0.4    80 7     "POOR"      WLTHWHTS
#>  8 wlthwhts     0.1    11 98    ""          WLTHWHTS
#>  9 wlthwhts     0       7 99    ""          WLTHWHTS
#> 10 wlthwhts    NA   44189 0     "IAP"       WLTHWHTS
#> 11 wlthwhts    NA     425 8     "DONT KNOW" WLTHWHTS
#> 12 wlthwhts    NA     123 9      <NA>       WLTHWHTS
#> 13 wlthwhts   100   64814 <NA>  "Total"     WLTHWHTS

Again, we can see rich - poor in the overall label, corresponding to the coding in the 1972-2018 doc file, but a nonsensical 1 hour 97 hours or more in the labels attribute that labels the scale in the current 1972-2021 version of the .dta file.

To confirm this is an issue in the .dta file, I took a look at it directly in Stata. If you fire up the current release of the 1972-2021 GSS in Stata you'll see the same miscoding there as well. A Stata screenshot to confirm:

image

I think you can see this issue in the GSS Data Explorer on the NORC website, as well (e.g. if you try to crosstabulate some of these variables the labels don't work properly).

In summary I think that, for some subset of GSS variables in the current release of the cumulative data file, something has gone wrong during the Data Dictionary process that links labels for various scales to the relevant questions using those scales. I'll try to get in touch with the people at NORC and alert them to the issue.

kjhealy avatar Aug 23 '22 20:08 kjhealy

Some more documentation of the error just for the sake of it. Here are some sample incorrectly-labeled tabulations from the Berkeley SDA archive of the 2021 GSS Cumulative file. (https://sda.berkeley.edu/sdaweb/analysis/?dataset=gss21)

image image

kjhealy avatar Aug 24 '22 15:08 kjhealy