OG-USA
OG-USA copied to clipboard
Calibrate number of people age {0-17, 18-64, 65+} per tax unit by s,j
Implementing UBI directly in OG-USA (https://github.com/PSLmodels/OG-USA/issues/626) requires calibrating the number of people per tax unit by s,j, split for each of the age groups that could have different UBI amounts, currently 0-17, 18-64, and 65+. We'll want to calculate the value per s,j and then apply kernel density smoothing.
@prrathi and I calculated unsmoothed values using CPS tax units in this notebook. Next step is to do it with PSID instead.
Seems like we can use psid_data_setup.py for this. Our first try crashed Colab but @prrathi will try it again.
@jdebacker, is psid_lifetime_income.pkl, produced in that script, too big for GitHub?
Or will we have to hold onto the columns listed in https://github.com/PSLmodels/OG-USA-Calibration/issues/6 and aggregate them along the way anyway, requiring modification to psid_data_setup.py?
@MaxGhenis Yes, psid_lifetime_income.pkl is too big for GH (~124 MB).
I haven't run that script on Colab, but runs locally fine (assuming you have all dependencies installed).
All columns in Issue #6 are already included in the PSID data saved to the repo.
Recapping next steps from a meeting with @prrathi:
- Verify that
head_age,spouse_ageandnum_children_under18are exported from thepsid_download.R(see comments in #6 on why these are the fields needed) - Verify that
psid_lifetime_income.pklalso preserves these variables; if not, may need to add toconstant_vars - Create a new file, e.g.
household_structure.py, which (a) calculatesnu18,n1864, andn65from these variables for each record inpsid_lifetime_income(per #6), (b) calculates the average of each of these bys,j, and (c) applies theMVKDEfunction to smooth these cells (see #25).
The KDE functions and the dependent scipy.stats.gaussian_kde require probability data. I think we have two options:
- Smooth with something like LOESS, though I couldn't find a multivariate LOESS smoother in Python
- Apply KDE using an extra dimension of the number of people, e.g. determining cells in
sbyjbynu18(orn1864orn65, separately).scipy.stats.gaussian_kdeaccepts multivariate (not just bivariate) data, so this should work, and then we can compute the average in eachsxjcell using the density estimates.
@jdebacker what would you suggest?
Actually @prrathi and I realized that we could use the existing KDE function where we model each sxj's share of total children/adults/seniors in the same way that e.g. the share of total transfers by sxj is modeled. Then we can multiply that by the current number of children/adults/seniors to get the average by sxj.
Yes - that is a good solution!
Some updates:
@prrathi tried the KDE with some PSID data, but it was still noisy because it's the quotient of a smoothed numerator (# kids in bin) and unsmoothed denominator (# families in bin). He's going to try smoothing the denominator too.
Given the PSID data issues described in #28, we tried returning to the taxdata CPS file in this notebook, and using stratified LOESS. Here's the raw data for 18-64:
To avoid the jumps, @prrathi is going to start with the counts excluding the household head, then add the household head to the appropriate count based on their age post hoc.
Here's the LOESS smoother with the 18-64 bin, just for household head ages 18-64 to avoid smoothing that spike:

And the residuals:

We tried some different values of frac (essentially bandwidth, defaults to 0.67), and found that 0.4 avoided large sustained residuals while also avoiding too many inflection points which seem implausible.
If the KDE smoothing for the numerator and denominator doesn't work as well, this stratified LOESS seems pretty good (though a multivariate LOESS would be better). @rickecon fyi.