pummeler icon indicating copy to clipboard operation
pummeler copied to clipboard

Log transform for US$ variables

Open flaxter opened this issue 9 years ago • 6 comments

Here's the variables I think we should log transform, all representing income/wages/etc.

VERSIONS = {
...
    'log_transform_feats': '''INTP OIP PAP RETP SEMP SSIP SSP WAGP PERNP
                            PINCP'''.split(),

Only issue is that some of these variables can be negative (for losses). So I guess the transformation for those should be x = log(x - min(x)) or something?

Once we figure that out it should be easy to put this into get_dummies.

flaxter avatar Nov 02 '16 21:11 flaxter

I think in my pre-pummeler attempt at this I did sign(x) * log(x + 1*sign(x)) or something. log(x - min(x)) isn't shaped very nicely if min(x) is, say, -915,729,293.

djsutherland avatar Nov 02 '16 21:11 djsutherland

don't understand... 1+sign(x)?

I just looked through the codebook more carefully. Most (all?) of these are truncated below ("Rounded & bottom-coded") so I think something like my solution actually makes sense. Sure, it won't be a normal distribution, but if we're featurizing using KDE than it'll just have a weird bump in the lower tail. Of course my solution doesn't work when x = min(x) so I guess now I'm proposing:

log(x - min(x) + 1)

flaxter avatar Nov 02 '16 22:11 flaxter

I was a little off before: what I want is sign(x) * log( |x| + 1 ), which maintains both sign information and magnitude information. Doing log(x - min(x) + 1) is weird because it conflates very-negative incomes with slightly-negative incomes, while the amount that moderate incomes are conflated depends on what the min is.

djsutherland avatar Nov 02 '16 22:11 djsutherland

OK, finally went through case-by-case using the sampled data. Here are the only two monetary variables that I found that can actually be negative:

  • INTP (Interest, dividends, and net rental income) has a bunch of true zeros ("None"). Only 0.2% were negative.
  • SEMP (Self-employment income) is same as INTP, with even more true zeros. Again only 0.2% were negative (correlated with INTP?)

So maybe we just do categorical variables for whether INTP/SEMP are non-zero? But I still don't know what transform to use for positive / negative. Here are our two proposals, neither looks great:

semp intp

flaxter avatar Nov 03 '16 08:11 flaxter

Update: forgot about PERNP, which can also be negative. Or have true zeros (no earnings)?

Also what's RACNUM = Number of major race groups represented 1..6 .Race groups mean?

flaxter avatar Nov 03 '16 09:11 flaxter

IIRC RACNUM is the flag for how many racial groups the person has indicated, with RAC1P the first race, RAC2P the second, etc.

djsutherland avatar Nov 03 '16 09:11 djsutherland