pummeler
pummeler copied to clipboard
Log transform for US$ variables
Here's the variables I think we should log transform, all representing income/wages/etc.
VERSIONS = {
...
'log_transform_feats': '''INTP OIP PAP RETP SEMP SSIP SSP WAGP PERNP
PINCP'''.split(),
Only issue is that some of these variables can be negative (for losses). So I guess the transformation for those should be x = log(x - min(x)) or something?
Once we figure that out it should be easy to put this into get_dummies.
I think in my pre-pummeler attempt at this I did sign(x) * log(x + 1*sign(x)) or something. log(x - min(x)) isn't shaped very nicely if min(x) is, say, -915,729,293.
don't understand... 1+sign(x)?
I just looked through the codebook more carefully. Most (all?) of these are truncated below ("Rounded & bottom-coded") so I think something like my solution actually makes sense. Sure, it won't be a normal distribution, but if we're featurizing using KDE than it'll just have a weird bump in the lower tail. Of course my solution doesn't work when x = min(x) so I guess now I'm proposing:
log(x - min(x) + 1)
I was a little off before: what I want is sign(x) * log( |x| + 1 ), which maintains both sign information and magnitude information. Doing log(x - min(x) + 1) is weird because it conflates very-negative incomes with slightly-negative incomes, while the amount that moderate incomes are conflated depends on what the min is.
OK, finally went through case-by-case using the sampled data. Here are the only two monetary variables that I found that can actually be negative:
INTP(Interest, dividends, and net rental income) has a bunch of true zeros ("None"). Only 0.2% were negative.SEMP(Self-employment income) is same asINTP, with even more true zeros. Again only 0.2% were negative (correlated withINTP?)
So maybe we just do categorical variables for whether INTP/SEMP are non-zero? But I still don't know what transform to use for positive / negative. Here are our two proposals, neither looks great:

Update: forgot about PERNP, which can also be negative. Or have true zeros (no earnings)?
Also what's RACNUM = Number of major race groups represented
1..6 .Race groups
mean?
IIRC RACNUM is the flag for how many racial groups the person has indicated, with RAC1P the first race, RAC2P the second, etc.