wru
wru copied to clipboard
Updated WRU - Different Numbers than Old Version
Hello, I updated from WRU 0.1-12 to WRU 1.0.1 so I could update the probabilities in our CA voter files with 2020 census data. When comparing the new probabilities (using WRU 1.0.1and 2020 census data) to our old probabilities (WRU 0.1-12 and 2010 census data), aggregated numbers are concerningly different. To make sure it wasn't just the new census numbers, I ran the same voter files using WRU 1.0.1 and 2010 census data, which were also very different than when using WRU 0.1-12 and 2010 census data on the same file. For example, Hispanic voters (calculated by summing probabilities at the county level) increased by more than 1 million using WRU 1.0.1 (2010 census data) from when using WRU 0.1-12 (2010 census data).
I also ran the predict_race test from your README file using your sample data set. Using WRU 0.1-12 (2010 census data), I got the exact same probability outputs as your screenshots. When using WRU 1.0.1 (2010 census data), however, the probability outputs were a little different. The test voter file appears to be the same with the exception of voterID 3 surname changing from Valesco to Rivera. See below:
WRU 0.1-12
WRU 1.0.1
Are these differences expected and maybe a change in methodology? I assumed both new and old versions would have similar, if not the same, outputs when using 2010 census data with the same voter file and the same settings. I appreciate any help. Also attaching my code used with my voter files for both WRU versions in case you catch something I missed.
WRU 0.1-12
WRU 1.0.1
data:image/s3,"s3://crabby-images/3a7ec/3a7ecf65b0505f9f5e80e720c6601e5fc5a4c2bb" alt="WRU 1 1_ScreenShot_AM CODE"
Also, thank you so much for this package!
We did make a change to the calculation of the probabilities. @etrrosenman @solivella @kosukeimai can comment here.
Hi @ameier88,
On a tangential topic, how did you and your team get the 2020 census data? From what I've been seeing, the sf1 file hasn't come out yet for 2020. I would definitely be interested in other leads/sources/ideas, though.
Also following to see the answer to your original question. My team has had problems with much larger swaths of unmatched names that prior runs and I would be interested to see if this is in any way linked to your problem.
@hirsch-sw the race data has been available for a while: https://www.census.gov/programs-surveys/decennial-census/about/rdo/summary-files.html. This is what we use in the package.
@hirsch-sw the race data has been available for a while: https://www.census.gov/programs-surveys/decennial-census/about/rdo/summary-files.html. This is what we use in the package.
Is it available with age and sex, though?
@hirsch-sw wru 1.0.0+ does not yet support any of the covariates (age, sex, party). Evan suggested that this might be the driving factor behind the differences that OP is showing.
With that said, age and gender by 2020 tract are also available from ACS. If you use tidycensus it's group B01001. https://api.census.gov/data/2020/acs/acs5/groups/B01001.json.
@ameier88 having taken a deeper look into this, I am concerned that we may be failing to condition on voter party based on the structure of your query. As Brandon mentioned, there is some difficulty in the use of covariates right now because the Census has not released the age and sex distributions; while we tried to structure the newest version of WRU to account for this appropriately, this may be an edge case.
To diagnose the issue, would it be possible to rerun your query using the old version of WRU but without passing in the "party" parameter? If those predictions look very similar to the ones from the new version and with the party parameter provided, that will be very informative.
@etrrosenman
Below are results of rerunning the scripts with old and new wru using 2010 census data for the first 15 California counties for the same voter file. I aggregated by summing probabilities. As you see, new wru is still very different than old wru with no covariates
Also attaching my scripts and first six voters to compare probability outputs (names removed and IDs cut off)
WRU 1.0.1 NO COVARIATES
data:image/s3,"s3://crabby-images/703cd/703cdd535903c66fe9e80fc2afe9b9c3f6b782d7" alt="WRU 1 0 1 _ScreenShot_SCRIPT_NoCovariates"
data:image/s3,"s3://crabby-images/24c5f/24c5f821a7216b9eba48a743d3e1963aa5a2225e" alt="WRU 1 0 1 _ScreenShot_NoCovariates"
WRU 0.1- 12 NO COVARIATES
data:image/s3,"s3://crabby-images/47da7/47da7979d3af2b096d16a7c2435c653e68eaddd6" alt="WRU 0 1-12 _ScreenShot_NoCovariates"
WRU 0.1- 12 With Party Covariate
data:image/s3,"s3://crabby-images/fd12e/fd12ea771b98147e09c0b24d0403a9332542fa19" alt="WRU 0 1-12 _ScreenShot_PartyTrue"
Hi @ameier88! We think this is related to how we are handling imputation of surnames that do not appear in the census dictionary. Would you mind checking whether the numbers you are seeing differ as substantially from those in the previous version of wru
when restricting your voterfile to records with last names that exist in the 2010 census dictionary? Alternatively, you can set impute.missing
to FALSE
.
Hello @solivella! Thank you for the suggestion. I ran wru using a bunch of different arguments and as you will see attached, the impute.missing = FALSE does make a difference but the numbers using 2010 census data are still not close to the old numbers for WRU 0.1-12
I'm going to close this one. There have been a large number of adjustments to the tables that were used and I think that we have addressed these issues. Notably, age and sex are now available in census data and now part of the most recent version of wru.