Imputation of e02000, e26270, p23250, p22250 from PUF to CPS
Hi, I am creating this as a general-purpose issue for this imputation task. As a preliminary step, here is my understanding of our proposed methodology for this task (using P22250 as a test case).
- Create a model to predict the sign of P22250 in cps (this has been done)
- For observations in cps predicted to have a zero, impute a zero.
- For observations in cps predicted to have negative or positive P22250 values, create separate models trained on the negative or positive puf P22250 data, then use those models to impute continuous values to their respective cps obserations.
I am comparing models for step 3 of this task. Taking the advice of @martinholmer, given the long-tails of the response variable, I am testing an OLS regression on on log(P22250) (log(-P22250) for negative data). Furthermore I will be using the same 80/20 training and testing split method as used to compare models in step 1. This is to say that for this portion of the task I will be working with 4 subsamples of the puf dataset created by splitting the data into negative/positive, then splitting each subsample into training/testing. I will train the negative P22250 model on the negative training subsample, and test it on the negative testing subsample, and work analogously with the positive model.
Finally, I plan on scoring the models' predictions for the testing datasets using MSE on the suggestion of @MaxGhenis, but I am of course open to other suggestions.
I'd recommend splitting into training/testing first, then splitting based on predicting sign. This will more easily enable you to compare performance between this branched model and the random forests.
MSE is a reasonable first step, but since we care about the distribution and the accuracy of the sign, I recommend the quantile loss function. I shared more detail on this at https://github.com/open-source-economics/taxdata/issues/221#issuecomment-401876770.
@MaxGhenis OK thanks, I will implement that. I am going to post an initial question regarding my OLS model soon as well.
My first issue i've run into using an OLS model that regresses on log(P22250) is that the model has so far been producing absurd predictions (in the billions) for P22250, and I wanted to ask for input on my code regarding that issue.
You can see my code at the bottom of this comment (under "Click to expand"). The predictors I use in that code are those that I thought would be good predictors and had p-values < 0.10. I also investigated more predictor groups in this notebook: OLS overestimation with various predictor groups. There I found that including any of the continuous predictors is what creates the massive overestimations, but lacking those predictors the model greatly underestimates P22250. Also, the problem persists (often getting worse) after altering the seed value to create new training/testing samples.
Note: pos_train and neg_train are the training datasets sampled from the positive P22250 and negative P22250 puf subsamples. pos_test and neg_test are the positive and negative testing datasets from the corresponding subsamples.
Some ideas as to what's causing the problem:
-
The problem might be the presence of collinearity between our predictors, as is noted in the warning regarding the model's large condition number (1.14e+07). However, an analysis of the predictors' correlation coefficients revealed that most have correlation coefficients of less than 0.45, even with their most-correlated co-predictor, and the predictors I use in this model are all below that line, as can be seen in the bottom of this notebook under "Check general correlations": puf correlations analysis Perhaps the presence of some correlation between all of the predictors is still too great an issue.
-
I also considered that divide by zero error I'm getting in response to the code that creates the
log_P22250column might be causing the issue. However, after working around that error by instead using a loop to createlog_P22250column (rather than using np.where()), I found the model produces the same results, even though the error doesn't manifest. -
Perhaps the issue is that I am doing the second log-transformation incorrectly, that is, the transformation of the model's predictions from log(P22250) to a real P22250 value. The correct method seems simple - to take the predicted log-values and transform them using np.exp() as i do in this code:
log_predictions_pos = fit_pos.predict(pos_test[pred_pos])
predictions_pos = np.exp(log_predictions_pos)
where log_predictions_pos are the model's predictions for log(P22250) for the positive testing
data, and predictions_pos should be the correctly re-transformed predictions - but
researching this specific question was surprisingly fruitless, so I am not sure if this is the correct method.
- Lastly, I tested an OLS model in which I also log-transformed the predictors themselves, based on some research which suggested this might be useful - especially if such a transformation makes those predictors normally distributed. This has had some promising results which I will soon include in another comment, unless the responses to this comment indicate a different path entirely.
Click to expand code
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn import metrics
from sklearn.model_selection import train_test_split
#Remove aggregate rows, replace NaN with 0
puf = pd.read_csv('puf2011.csv')
puf = puf[(puf['RECID'] != 999996) &
(puf['RECID'] != 999997) &
(puf['RECID'] != 999998) &
(puf['RECID'] != 999999)
]
puf = puf.fillna(0)
#MARS to k-1 dummies
puf[['MARS2', 'MARS3', 'MARS4']] = pd.get_dummies(puf['MARS'], drop_first = True)
#These vars are combined in cps
puf['E19800_E20100'] = puf['E19800'] + puf['E20100']
#All vars shared between puf and cps except E00650 (colinear w/E00600) and E01100 (crashing mnlogit)
potential_pred = [
'DSI', 'EIC', 'MARS2', 'MARS3', 'MARS4', 'XTOT', 'E00200',
'E00300', 'E00400', 'E00600', 'E00800', 'E00900', 'E01400',
'E01500', 'E01700', 'E02100', 'E02300', 'E02400', 'E03150',
'E03210', 'E03240', 'E03270', 'E03300', 'E17500', 'E18400',
'E18500', 'E19200', 'E19800_E20100','E20400', 'E32800',
'F2441', 'N24'
]
#Log of response
puf['log_P22250'] = np.where(puf['P22250'] == 0, 0, np.sign(puf['P22250'])*np.log(abs(puf['P22250'])))
keep = ['P22250', 'AGIR1', 'log_P22250'] + potential_pred
puf = puf[keep]
np.random.seed(100)
train, test = train_test_split(puf, test_size=0.2)
#Subsamples of train/test where P22250 > 0, or < 0 pos or neg for 2nd stage imputation
pos_train = train.copy()[train.copy()['P22250'] > 0]
neg_train = train.copy()[train.copy()['P22250'] < 0]
pos_test = test.copy()[test.copy()['P22250'] > 0]
neg_test = test.copy()[test.copy()['P22250'] < 0]
pred_pos = [
'DSI', 'EIC', 'MARS2', 'MARS3', 'MARS4', 'XTOT', 'E00200',
'E00300', 'E00400', 'E00600', 'E03240',
'E18400', 'E18500'
]
endog_pos = pos_train['log_P22250']
exog_pos = pos_train[pred_pos]
fit_pos = sm.OLS(endog_pos, exog_pos).fit()
summary_pos = fit_pos.summary()
#Predictions of log(P22250), transforming back to P22250
log_predictions_pos = fit_pos.predict(pos_test[pred_pos])
predictions_pos = np.exp(log_predictions_pos)
max_predictions_pos = round(predictions_pos.max(), 2)
MSE_pos = round(metrics.mean_squared_error(actual_pos, predictions_pos),2)
print(
'Largest predicted P22250: $' + '{:,}'.format(max_predictions_pos), '\n',
'MSE is: ' + str(MSE_pos), '\n',
summary_pos
)
Based on further reading it seems that to justify log-transforming our predictors I should be looking at how that transformation changes the model's residuals - either making them symmetrical or not changed systematically by the value of the response variable. I will follow with this residual analysis of the above model, and the log-transformed-predictor model soon.
In the mean time I have two methodological questions:
-
As the 2nd stage models are intended to predict the non-zero data alone, they will be trained and testing on relatively small subsamples of puf. Specifically,
pos_trainandpost_testhave ~ 15,800 and 3,907 observations respectively, andneg_trainandneg_testhave 18,600 and 4,744 observations respectively. Is this small enough to motivate evaluating the models not on their performance on one training/testing subsample, but on their average performance across multiple, so that more puf data have the chance to be training/testing data? -
E02300 (Unemployment insurance benefits) and E03210 (student loan interest deduction) seem like potentially useful signals, but only to indicate if someone is unemployed or if someone is still paying off student loans. I would imagine that those who are either unemployed or paying off student loans would tend to have less income from capital gains or corporations/partnerships. That being said, do we really gain more information by knowing if they have larger or smaller student loans or unemployment benefits? I don't think we do, except in so far as having greater unemployment benefits indicates that one's wages are higher, but that data should be capture in
E00200. I would propose making these variables into dummy variables: 0 if 0, and 1 if > 0 (both are only 0 or positive).
Looking at the residuals it doesn't appear that log-transforming the predictors is having much of a beneficial effect, despite the predictions being somewhat more realistic.
Regressing on log(P22250) without log-transforming predictors
Largest predictions are so outrageous that most of residual plot is incomprehensible

Removing all predictions greater than $100,000,000 (8 in total) tells a similar story, most large residuals are concentrated in observations with low P22250 values.

And removing all predictions larger than $10,000,000 (14 in total) again shows that the smaller P22250 values are generally overestimated.

Regressing on log(P22250) having log-transformed the continuous predictors.
Here we are using the same predictors, the only difference in the code is that I log-transformed the continuous predictors, and the residuals, while much smaller, have the same overall pattern - lots of overestimation clustered around the smaller actual P22250 values.

I would not think this supports log-transforming the predictors, and unless there's something wrong in how I am using the OLS model / log transformations, I think this analysis wouldn't support using an OLS model to predict P22250's value at all.
@Abraham-Leventhal asked:
- As the 2nd stage models are intended to predict the non-zero data alone, they will be trained and testing on relatively small subsamples of puf. Specifically, pos_train and post_test have ~ 15,800 and 3,907 observations respectively, and neg_train and neg_test have 18,600 and 4,744 observations respectively. Is this small enough to motivate evaluating the models not on their performance on one training/testing subsample, but on their average performance across multiple, so that more puf data have the chance to be training/testing data?
By "average performance across multple", do you mean Cross-Validation?
If so, I think this is plenty of data to do that. One way to do CV is:
- Chose how many folds you want. It is common to use
K=10, but this is somewhat arbitrary - For each of fold
K=k, train your model on the remaining samples * - Evaluate predictions on hold out data (fold
k). Save the MSE (or whichever metric you are using) - Choose model with best average metric score across the different CV iterations
* Some sources recommend that you do predictor selection in this step. If you do predictor selection outside of this step, then theoretically, your model is biased because you evaluated your predictors on the basis of the entire sample. So some "knowledge" of the fold of samples that is excluded in this step is already baked into the process. However, this particular problem requires substantial human evaluation to select the predictors. So, I think it's unreasonable to do so in (2) for this particular problem. For reference see, section 7.10.2 in The Elements of Statistical Learning.
However, my only experience with this is from an ML class I took a couple years ago. @MaxGhenis does this jive with your understanding of CV?
@hdoupe
By "average performance across multiple", do you mean Cross-Validation?
Cross validation requires that you run K training/testing iterations for each model you are evaluating, such that for each model, every observation is used once and only once in a testing dataset across the iterations. Maybe we should do that, but I was leaving open the possibility of doing multiple training/testing iterations without satisfying the strict requirements of a K-fold cross validation - perhaps by just altering the seed value before re-partitioning puf, and running the tests again. Then doing that 3 or 4 times. Perhaps puf is large enough that this would work. Or maybe a the K-fold method is a better idea.
- Some sources recommend that you do predictor selection in this step...
I don't understand what you're saying here. I do see one way that having to choose both predictors and models adds a layer of complexity (for instance, what if some models perform better on some predictor sets, aren't we implicitly favoring one model by choosing our predictors for the cross-validation based on their performance in that model in a preliminary prediction task...etc), but I think my knowledge of model bias is a little shallow to grasp the rest of your point.
I don't think the requirements are too strict. Also, I'm not so sure about choosing an ad hoc method over it. The steps seem straightforward:
- shuffle data
- divide data set into
Kpartitions (0 through N/k, N/k +1 through 2*N/k, 2*N/k + 1 through 3*N/k, ...) - train on all but the
k-th data partition. Evaluate on thek-th partition - average the results
- select model with best average score
Depending on how your model is set up, you may be able to use sklearn's CV functionality.
Does that make more sense?
RE the second part: I guess the whole idea is to build a model so that it performs well even on data that it hasn't seen yet. You simulate that by removing a partition of your data set and evaluating your model on it. But, if you choose your predictors based off of the entire data set, including the partioned out part, then it's like your model has had a "glimpse" of the partioned out data. This biases your test error rate down (makes your model appear slightly better than it actually is), since the model already has some information on that hold out partition of the data via the predictor selection process.
In cross-validation, the idea is to repeat this train/test process K times and average the results in hopes of mitigating the effects of fortunate or unfortunate train/test splits. So, if you have already selected your predictors beforehand, then your model has some information on the holdout fold, even though it appears to only be trained on the other K-1 folds.
However, if the predictor selection process is resource intensive (i.e. you examining lots of correlations between predictors and transformations of predictors), it's probably not worth going through the trouble of screening predictors for each iteration of cross-validation.
Does that make anymore sense?
@hdoupe
Sure, the method makes sense and if you think K-fold cross validation would be best I would go along with that.
But, if you choose your predictors based off of the entire data set, including the partitioned out part, then it's like your model has had a "glimpse" of the partitioned out data.
Ah this makes a lot of sense.
However, if the predictor selection process is resource intensive (i.e. you examining lots of correlations between predictors and transformations of predictors), it's probably not worth going through the trouble of screening predictors for each iteration of cross-validation.
This is probably true as well. Of course, I'd have to get past these issues with OLS before getting to that point.
Given the distribution of the non-zero P22250 data (and this is true for the other imputation variables as well), I was thinking that an exponential model might be of use, what do you think of this? Here is a screenshot of the distribution of 95% of the P22250 data that I made earlier in my internship:

These tails continue left and right quite far, and we have two major negative outliers with P22250 values of about -124,000,000 and -60,000,000 with the next-lowest being about -30,000,000 at which point the gaps between data points shrink quite a bit.
CV is helpful for variable selection and other tuning parameters; for example, random forests do something like CV as part of the algorithm, and there are prebuilt CV methods for hyperparameter tuning in models like the Lasso (see LassoCV in sklearn). In general, sklearn is much more built for predictive modeling than statsmodels so has more resampling and evaluation techniques built-in. Since you're using a standard OLS though, there's no hyperparameter tuning going on, so I don't think CV is necessary (you should expect better predictions if using regularized models like lasso or ridge).
That said, I'd still recommend assessing the data on a holdout set for a final comparison of models. This lets you experiment with the training set in multiple ways (CV, bootstrap, etc.) while having a clear way to make the final judgment. Modeling competitions like those from Kaggle operate in this way.
@MaxGhenis Thanks for the suggestions, I'm working on comparing LassoCV and your earlier random forests model for this stage of the project (continuous predictions).
Question: As both lasso and random forests contain prebuilt CV (or CV-like) methods would it be valid to train them on the entire (positive or negative portions of the) puf data set then test them on the (positive or negative portions of the) holdout test dataset? If the models are doing their own CV does that mitigate the danger of allowing them to get a glimpse of the testing data during the training stage?
The other option would be to do what I am doing with the OLS models: train the positive-data models on the positive portion of the training subset (pos_train), test those models on the positive portion of the testing subset (pos_test), and so on.
Thanks again for your feedback
Assuming that we'd go with the latter option (training both models only on the training dataset), I tried out both random forests and LassoCV in this notebook and it appears LassoCV outperforms random forests (especially for the negative data) though the error is quite large for both: rf / LassoCV notebook
I'm transforming MARS to k-1 dummy variables because it doesn't appear that scikit learn will do so automatically if MARS is a string, and based on my research one is supposed to only use k-1 dummies when k = # of categories. Please correct me if I am wrong in this regard.
You're comparing two models:
- Random forests (just a single model)
- Trinomial logit + two linear models for positive and negative logit predictions
In each case you're evaluating on the test set, like this:

If you want to segment your evaluation into positive/negative/zero, I'd recommend doing so on consistent segments of the testing data, e.g. positive/negative/zero. In the most straightforward case you're comparing a record of which both model 1 and model 2 correctly predict the sign, and you're comparing the error. There will also be cases where models disagree with each other or the actual sign, so segmenting on true sign will be fairest.
I'm transforming MARS to k-1 dummy variables because it doesn't appear that scikit learn will do so automatically if MARS is a string, and based on my research one is supposed to only use k-1 dummies when k = # of categories. Please correct me if I am wrong in this regard.
Right, sklearn doesn't automatically transform as statsmodels does. pd.get_dummies is helpful. k-1 dummies avoids collinearity.
OK, I had thought that since mnlogit produced better categorical predictions for P22250 than random forests (using log loss as our metric) we'd only consider mnlogit as a sign-prediction model (unless we tested more categorical prediction models). Thus, if random forest were to be used at all, it would be only for the second stage of the task - predicting P22250's continuous value for those observations predicted by mnlogit to be non-zero.
It seems that you're suggesting that random forests should only be considered as a model if it is used for both stages, is that correct?
In the most straightforward case you're comparing a record of which both model 1 and model 2 correctly predict the sign, and you're comparing the error. There will also be cases where models disagree with each other or the actual sign, so segmenting on true sign will be fairest.
This is an interesting issue that I had not considered. What do you mean by "segmenting on true sign will be fairest?" Is that to say that we should only compare MSE's for those observations where both methods get the sign correct?
Thanks
Logit + 2 RFs could be a third model, but RF alone is worth testing and I'd personally start with a single RF vs. logit+LM. The single RF will perform better with more data.
The most important thing IMO is deciding how to decide between models. Whatever the metric is, it should apply to the entire test set. If it's an average of quantile loss functions, it's possible that the single RF would outperform the logit+2RFs, even if the single RF is worse at the -/0/+ classification; this could also be true of MSE.
My "segmenting on true sign will be fairest" regarded your comparison of MSE between RFs based on predicted sign. I was suggesting that if you're interested in MSE on the test set, you can also look at particular slices of the test set to see where models do better/worse, but a set of metrics without a predefined weighting function won't give a clear answer on which model to choose.
@MaxGhenis
Logit + 2 RFs could be a third model, but RF alone is worth testing and I'd personally start with a single RF vs. logit+LM. The single RF will perform better with more data.
This makes sense.
Here is an idea! If we are to try logit + RF, instead of training two RF's on the pos/neg data, train just one RF. Then run the logit model and impute 0 for all observations predicted to be 0. Then, for the logit's negative predictions, impute the mean of the negative RF estimators for that observation (or a random RF estimator). For the positive predictions, impute the mean of the positive RF estimators for that observation. This would probably lower the MSE but would avoid the "zero-fuzzing" that occurs when using the average of the entire row for all observations.
This would probably lower the MSE but would avoid the "zero-fuzzing" that occurs when using the average of the entire row for all observations.
You should select a random tree's prediction from the random forest, not the average prediction. This makes the "stochastic" part of stochastic imputation built-in and avoids zero-fuzzing.
For a fair comparison, you should compare that against a random draw from the prediction interval of the linear models. This simulates how the real data would be imputed.
I'd hold off on adding complexity to the RF model until there's a clear case that random forest by itself is sufficiently problematic. As you mentioned in an email, log-loss isn't actually much worse when selecting different seeds, and adding complexity adds room for overfitting (and just modeling errors).
Deep quantile regression is another model which, on a theoretical basis, should produce the best predictions for this use case.
You should select a random tree's prediction from the random forest, not the average prediction. This makes the "stochastic" part of stochastic imputation built-in and avoids zero-fuzzing.
For a fair comparison, you should compare that against a random draw from the prediction interval of the linear models. This simulates how the real data would be imputed.
I am still a bit unclear on the motivation behind inserting randomness, which underlies some of my incorrect assumptions going in to this process (directly imputing predicted categories in stage 1, doing the same for continuous values in stage 2, rather than using a stochastic process).
Martin mentioned in the prior issue that it would make the distribution of data more realistic, which makes sense. However, you've also mentioned that it would help deal with overfitting, which is not so clear to me.
Either way, in the short term, I will go about the following:
- Change my linear models' code to draw from their prediction intervals
- This will require research, and if you have recommended functions to perform this that would be much appreciated.
- Check out deep quantile regression
- Investigate quantile loss as a model metric
- Again this will also require research. I still have the links from your earlier comment but they only address the theory, so I will look up worked examples in python and if you have leads on that I would appreciate it.
Thanks again max!
I am still a bit unclear on the motivation behind inserting randomness, which underlies some of my incorrect assumptions going in to this process (directly imputing predicted categories in stage 1, doing the same for continuous values in stage 2, rather than using a stochastic process).
Stage 1 also has randomness: rather than selecting the class with the highest probability, you've been selecting a random number and selecting based on how that number falls with respect to the probabilities of each predicted class. Selecting from the CDF in stage 2 is the same idea, just from the continuous angle.
The advantage of randomness is getting the appropriate spread of predictions. Without randomness, you'd predict something like the mean (point estimate), which would be relatively close to zero for all observations. While that's best for each individual estimate, it produces a set of predictions that's too narrow: for example, if you had a reform that only taxed high values of P22250, or had progressive brackets of it, you'd underestimate the impact if not also pushing some tax units toward the high range of their prediction interval.
However, you've also mentioned that it would help deal with overfitting, which is not so clear to me.
Did I imply that somewhere? This was not my intent. It's just about realistic data.
Unfortunately I may have led you astray with sklearn, for which I'm having no luck finding prediction intervals. They've explicitly said it's out of scope. So you may need to return to statsmodels after all, which has a get_prediction function that includes prediction intervals.
That said, if the OLS point estimate is significantly less accurate than the RF, it's unlikely that the quantiles would do better (if anything I'd expect the RF quantiles to be better than OLS, even if they had the same point estimate MSE). So you might be able to save yourself that work.
Here's a sample implementation of the quantile loss function.
Here's an example using sklearn for prediction intervals. It's not clear whether it works for regularized models like Lasso.
Stage 1 also has randomness
Right, I was at least aware that I was inserting randomness into the stage 1 using a runif + CDF to impute P22250's sign, but I've only now realized that this is to apply to both stages and why! It's about getting a realistic distribution
That said, if the OLS point estimate is significantly less accurate than the RF, it's unlikely that the quantiles would do better (if anything I'd expect the RF quantiles to be better than OLS, even if they had the same point estimate MSE).
Yes so far my non-stochastic imputations of P22250's continuous value using OLS and LassoCV (regressing on P22250, or log(P22250), or log(P22250) + log(continuous predictors)) have been orders of magnitude worse than RF in MSE. Its impressive that RF is able to that when its trained on data that are (in P22250's case) 65% zero - without simply predicting almost entirely near-zero values. Check out RF's predictions of P22250 over AGI vs actual puf data (this is for the testing dataset) it's quite good, though I should mention I cut out the 3 most-negative values in the real testing data which RF missed entirely (with around 20, 35, 125 million in losses respectively):

Also thanks for the quantile loss / prediction interval links
@Abraham-Leventhal, I see that you're having problems in your regression with log(gain) for the p22250 variable. Most of the commentary here in issue #267 is impossible to follow because the comments rarely show what you're actually doing. But in this comment you do provide a link to a notebook where you're estimating a log(gain) regression and that notebook looks, in part, like this:
Your selection of explanatory variables (the right-hand-side variables in the regression) is unconventional.
First, it is conventional to include a constant term in the regression. In fact, it is so conventional that the very first thing the statsmodels documentation shows you is how to add a constant term to the list of regression variables. So at this Linear Regression documentation page we have the following example:
I would be interested in seeing what results you get when you include a constant term in your regression.
Second, the EIC variable is categorical (like the MARS variable), so you should convert EIC into a set of dummy variables (omitting the first category) just like you did with the MARS variable.
Second, the EIC variable is categorical (like the MARS variable), so you should convert EIC into a set of dummy variables (omitting the first category) just like you did with the MARS variable.
See @Abraham-Leventhal's https://github.com/open-source-economics/taxdata/issues/244#issuecomment-404278599 on EIC being continuous.
@martinholmer My mistake re: the constant term, thanks for catching that! I will get to that when I am back on my work laptop on Monday and post my results and my code.
Most of the commentary here in issue #267 is impossible to follow because the comments rarely show what you're actually doing.
In the two comments where I discuss scripts directly (OLS comment - see raw code at bottom of comment, which you linked, and this one comparing RF and LassoCV) I provide my code. Otherwise I didn't provide code where we discussed general methodology/stats or where I just wanted to show graphs about which I did not have specific questions. Let me know if you would like to see the code behind anything else I have posted here.
Second, the EIC variable is categorical (like the MARS variable), so you should convert EIC into a set of dummy variables (omitting the first category) just like you did with the MARS variable.
It doesn't appear that EIC is categorical, based on the PUF booklet:

Here is a notebook comparing random forests to OLS models using RMSE. My code is also included at the bottom of this comment. I test out OLS models regressing on log(P22250) both without transforming the predictors and with log-transforming the continuous predictors. I made sure to include a constant term in the OLS models, and to distinguish between RF's overall RMSE and its RMSE in predicting pos/neg data - doing so reveals that OLS outperforms RF (in the positive/negative data) using RMSE as a metric when regressing on log(P22250) and using log-transformed continuous predictors.
I will follow up with comparisons using quantile loss, and other analyses of these models such as the distribution of their predictions over AGI which @andersonfrailey thought would be an important metric.
You also might have noticed that RF's log-loss in this notebook is better than the older number we were getting in this notebook. This is important because it was that score that we used to determine that nnet outperformed RF as a categorical predictor, and thus that mnlogit should be used considered in that task. It turns out that the method of calculating RF's log-loss used here returns different results depending on the random seed used to create the train/test partition, and RF thus sometimes outperforms mnlogit in log-loss. In some trials I conducted I found that mnlogit's log-loss varies little, hovering around 0.58, between random seeds, while RF varies in 0.57 - 0.71. I will provide the code demonstrating this in a future comment, or in my internship-end summary that I give to Anderson. You can also see it by changing the seed value in Cell [2].
Click to expand full code (without some of the summary data included in the notebook)
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels import regression
from sklearn import ensemble
from sklearn import linear_model
from sklearn import metrics
from sklearn import model_selection
from sklearn.model_selection import train_test_split
# Data
# Remove aggregate rows, replace NaN with 0
puf = pd.read_csv('puf2011.csv')
puf = puf[(puf['RECID'] != 999996) &
(puf['RECID'] != 999997) &
(puf['RECID'] != 999998) &
(puf['RECID'] != 999999)
]
puf = puf.fillna(0)
# Constant column
puf['constant'] = 1
# MARS to K - 1 dummies
puf[['MARS2', 'MARS3', 'MARS4']] = pd.get_dummies(puf['MARS'], drop_first = True)
# E19800 and E20100 combined in CPS
puf['E19800_E20100'] = puf['E19800'] + puf['E20100']
# Categorical dependent variable for 1st stage
puf['sign'] = np.where(puf['P22250'] == 0, 'zer', np.where(puf['P22250'] > 0, 'pos', 'neg'))
# Log response column. When x < 0, result is -log(-x)
puf['log_P22250'] = np.where(puf['P22250'] == 0, 0, np.sign(puf['P22250'])*np.log(abs(puf['P22250'])))
# All variables shared between puf and cps, except for E00650 (colinear w/E00600)
predictors = [
'DSI', 'EIC', 'MARS2', 'MARS3', 'MARS4', 'XTOT',
'E00200', 'E00300', 'E00400', 'E00600', 'E00800', 'E00900',
'E01400', 'E01500', 'E01700', 'E02100', 'E02300', 'E02400',
'E03150', 'E03210', 'E03240', 'E03270', 'E03300', 'E17500',
'E18400', 'E18500', 'E19200', 'E19800_E20100','E20400',
'E32800', 'F2441', 'N24', 'E01100'
]
# Log columns for continuous predictors. When x < 0, result is -log(-x)
discretes = ['DSI', 'EIC', 'MARS2', 'MARS3', 'MARS4', 'XTOT', 'F2441', 'N24']
logs = []
for i in predictors:
if i not in discretes:
puf['log_' + i] = np.where(puf[i] == 0, 0, np.sign(puf[i])*np.log(abs(puf[i])))
logs.append('log_' + i)
keep = ['RECID', 'AGIR1', 'sign', 'P22250', 'log_P22250', 'constant'] + predictors + logs
puf = puf[keep]
np.random.seed(100)
train, test = train_test_split(puf.copy(), test_size=0.2)
# Sub-df's where P22250 != 0, and is pos or neg
pos_train = train.copy()[train.copy()['P22250'] > 0]
neg_train = train.copy()[train.copy()['P22250'] < 0]
pos_test = test.copy()[test.copy()['P22250'] > 0]
neg_test = test.copy()[test.copy()['P22250'] < 0]
# Random Forests
# 1-stage prediction
# 100 estimators
N_ESTIMATORS = 100
rf = ensemble.RandomForestRegressor(n_estimators=N_ESTIMATORS,
min_samples_leaf=1, random_state=3,
verbose=True,
n_jobs=-1) # Use maximum number of cores.
rf.fit(train[predictors], train['P22250'])
# rf_preds = array of estimators
rf_preds = []
for estimator in rf.estimators_:
rf_preds.append(estimator.predict(test[predictors]))
rf_preds = np.array(rf_preds).transpose() # One row per record.
# Validation
# Log-loss
# We can calculate the RF model's predicted probability for each sign (and thus its log-loss) using the % of estimators predicting that sign for each observation.
# Note: `metrics.log_loss()` assumes that the order of the columns of predicted probabilities correspond to their categories' alphabetical order. Our categories are `'neg'`, `'zer'` and `'pos'`, which have the alphabetical order of `'neg'`, `'pos'`, `'zer'`, thus they appear in that order in `rf_pred_proba` as `[preds_neg, preds_pos, preds_zer]`
preds_neg = np.sum(rf_preds < 0, axis=1) / 100
preds_zer = np.sum(rf_preds == 0, axis=1) / 100
preds_pos = np.sum(rf_preds > 0, axis=1) / 100
rf_pred_proba = list(map(list, zip(*[preds_neg, preds_pos, preds_zer])))
metrics.log_loss(test['sign'], rf_pred_proba)
# Continuous prediction
# Random estimator selected so that imputation is stochastic
rand_col = np.random.randint(N_ESTIMATORS, size=rf_preds.shape[0])
random_tree = rf_preds[np.arange(rf_preds.shape[0]), rand_col]
pred_random_tree = pd.DataFrame({'actual': test['P22250'],
'actual_sign': test['sign'],
'pred': random_tree})
pred_random_tree['error'] = pred_random_tree.pred - pred_random_tree.actual
pred_random_tree['pred_sign'] = np.where(pred_random_tree['pred'] == 0, 'zer', np.where(pred_random_tree['pred'] > 0, 'pos', 'neg'))
pred_random_tree['correct_sign'] = (
pred_random_tree.actual_sign == pred_random_tree.pred_sign)
pred_random_tree['count'] = 1
# RMSE on whole test set
pred_random_tree.error.pow(2).mean() ** 0.5
# RMSE on positive data
pred_random_tree[pred_random_tree['actual'] > 0]['error'].pow(2).mean()**0.5
# RMSE on negative data
pred_random_tree[pred_random_tree['actual'] < 0]['error'].pow(2).mean()**0.5
# OLS
# Positive data
# Regressing on log(P22250)
# None of the non-zero observations of E01100 made it into training data, so pos_train['E01100']
# is just a column of zeroes, and is thus excluded.
# Predictors included are those with p values <= 0.1
ols_pos_predictors = [
'constant', 'DSI', 'EIC', 'XTOT', 'E00200', 'E00300',
'E00400', 'E00600', 'E01400', 'E18400', 'E18500', 'E19200',
'E19800_E20100', 'E20400', 'E32800'
]
ols_pos_fit = sm.OLS(pos_train['log_P22250'], pos_train[ols_pos_predictors]).fit()
# RMSE is massive
ols_pos_pred = np.exp(ols_pos_fit.predict(pos_test[ols_pos_predictors]))
(pos_test['P22250'] - ols_pos_pred).pow(2).mean()**0.5
# Regressing on log(P22250), using log-transformed continuous predictors
# Predictors included are those with p values <= 0.1
ols_logpos_predictors = [
'constant', 'DSI', 'MARS3', 'N24', 'log_E00200', 'log_E00300',
'log_E00400', 'log_E00600', 'log_E01500', 'log_E02400', 'log_E03210',
'log_E03270', 'log_E03300', 'log_E17500', 'log_E18400', 'log_E19200',
'log_E20400', 'log_E32800'
]
ols_logpos_fit = sm.OLS(pos_train['log_P22250'], pos_train[ols_logpos_predictors]).fit()
# RMSE is lower than random forest's positive predictions
ols_logpos_pred = np.exp(ols_logpos_fit.predict(pos_test[ols_logpos_predictors]))
(pos_test['P22250'] - ols_logpos_pred).pow(2).mean()**0.5
# Negative data
# Regressing on log(P22250)
# Predictors with p value <= 0.1
ols_neg_predictors = [
'constant', 'DSI', 'EIC', 'MARS3', 'E00200', 'E00300', 'E00400',
'E00600', 'E01400', 'E01500', 'E02100', 'E02400', 'E03210', 'E03270',
'E18400', 'E18500', 'E19200', 'E20400', 'E32800', 'F2441'
]
ols_neg_fit = sm.OLS(neg_train['log_P22250'], neg_train[ols_neg_predictors]).fit()
# RMSE is even larger than positive model regressing on non-log-transformed continuous predictors
ols_neg_pred = -np.exp(-ols_neg_fit.predict(neg_test[ols_neg_predictors]))
(neg_test['P22250'] - ols_neg_pred).pow(2).mean()**0.5
# Regressing on log(P22250), using log-transformed continuous predictors
# Predictors included are those with p values <= 0.1
ols_logneg_predictors = [
'constant', 'DSI', 'EIC', 'MARS3', 'XTOT', 'F2441', 'log_E00200',
'log_E00300', 'log_E00400', 'log_E00600', 'log_E01500', 'log_E01700',
'log_E03270', 'log_E03300', 'log_E17500', 'log_E18400', 'log_E19200',
'log_E19800_E20100', 'log_E20400'
]
ols_logneg_fit = sm.OLS(neg_train['log_P22250'], neg_train[ols_logneg_predictors]).fit()
# RMSE is again much better when regressing on log(predictors)
ols_logneg_pred = -np.exp(-ols_logneg_fit.predict(neg_test[ols_logneg_predictors]))
(neg_test['P22250'] - ols_logneg_pred).pow(2).mean()**0.5
Thanks Avi, does this chart represent the results correctly?
| Method | RMSE on full test data | RMSE on positive test data | RMSE on negative test data |
|---|---|---|---|
| RF random tree (not point estimate) | 886,796 | 1,227,639 | 999,927 |
| OLS log outcome, nonlog predictors | 19,746,052,052 | 890,112 | 23,007,880,474,031 |
| OLS log outcome, log predictors | Not produced | Not produced | 602,248 |
Can you use the RF predict function instead of the random tree? This will produce the point estimate, which would be apples-to-apples with the OLS models (it will have much better MSE as well). The analog for the random tree to OLS models would be a random point from each observation's predicted CDF. This will be useful for quantile loss, but not MSE as much.
I suggested in https://github.com/open-source-economics/taxdata/issues/267#issuecomment-408935003 building a full 2-stage model to compare against RF. I think that would be the most beneficial next step. At this point the only valid comparison is 886,796 to 19,746,052,052; segmenting this by the true label is only helpful if the pos/0/neg classifier is perfect.
I'd also recommend a different transformation to deal with zeros or negative values. The one in the notebook (np.where(x == 0, 0, np.sign(x) * np.log(abs(x)))) maps the same value for, say, 0.5 and -2 (-0.69), and maps zero above this (0). The simplest approach is adding the feature's minimum value plus one to the feature, then log-transform that. A more robust approach could be this.
Good idea to make a chart. I made some corrections and added cell references:
(these data are from this OLS vs RF notebook)
| Method | RMSE on full test data | RMSE on positive test data | RMSE on negative test data |
|---|---|---|---|
| RF random tree (not point estimate) | 886,796 (cell 9) | 1,227,639 (cell 10) | 999,927 (cell 11) |
| OLS log outcome, nonlog predictors | Not produced | 19,746,052,052 (cell 13) | 23,007,880,474,031 (cell 17) |
| OLS log outcome, log predictors | Not produced | 890,112 (cell 15) | 602,248 (cell 19) |
Regarding the point estimate vs random tree yes I can do that, that makes sense. Though, long-term I understand the goal would be to use a random tree (if RF) or random point from the CDF (if OLS) so that the imputation is stochastic, correct?
I suggested in #267 (comment) building a full 2-stage model to compare against RF. I think that would be the most beneficial next step.
OK, just to clarify, the best method of testing the branched approach would be to first ensure we're using the same imputation method as we do in RF (either point estimate, or random point/tree), then:
- Train logistic classifier on all the training data, use it to predict -/0/+ in testing data. Impute a 0 to the testing data predicted to be 0.
- Train OLS models on the negative/positive training data, use them to predict the continuous values of the testing data predicted to be negative/positive by the logistic classifier.
- Calculate RMSE (or whatever metric is used)
- Compare to RMSE of 1-stage RF
Regarding the logs: thanks for pointing that out! Here's what seems to be the core of the R code from the linked "robust method:"
z <- (x - min(x)) / (max(x) - min(x)) * 2 - 1
z <- z[-min(z)]
z <- z[-max(z)]
t <- atanh(z)
x would be the predictor column to be transformed, t would be the final transformed output. My question is, what are the middle two lines doing? They appear to squash z to a scalar when I test it in R, making t <- atanh(z) throw an exception.
Thank you very much for your detailed input
long-term I understand the goal would be to use a random tree (if RF) or random point from the CDF (if OLS) so that the imputation is stochastic, correct?
Yes, though since this will be challenging with the two-stage linear method, better to check that the MSEs are comparable first.
Your steps LGTM. I'd think about it as creating a new sort of predict function (you could call it something like predict_2stage), which takes a record and produces a prediction from the two-stage model. This will first assign -/0/+ based on a random number and the CDF, then a value within -/+ based on the OLS models. Unlike RF, even for calculating something like a point estimate, this is stochastic due to the -/0/+ labeling. Once you build this function, you can pass the test set and compare RMSE to RF.
You're right on the SO answer, which is erroneous. A correct approach would be something like atanh(((x - min(x)) / (max(x) - min(x)))) but this yields infinity for the minimum and maximum, so you'd want to add a buffer. For this data it might also make sense to make it symmetric to sit along zero.
While this appears to have good statistical properties, I can't find a real name for it (it's not quite the Fisher transformation as the author suggests), so I'd probably just do the more common log(x + min(x) + 1).
This will first assign -/0/+ based on a random number and the CDF, then a value within -/+ based on the OLS models. Unlike RF, even for calculating something like a point estimate, this is stochastic due to the -/0/+ labeling.
If I understand you correctly, when I am testing a full 2-stage model using a stochastic sign prediction, the final prediction will be stochastic even if I use the point estimate from the OLS models rather than the get_prediction function?
so I'd probably just do the more common log(x + min(x) + 1).
OK, I will implement this and share the results.
Also, with regards to log-transforming our predictors, I understand that one indication that log transformation might be useful is if it makes your predictors normally distributed.
Here is a notebook where I compare the distributions of our continuous predictors before and after log-transforming them using log(x + sign(min(x))*min(x) + 1). I multiply min(x) by sign(min(x)) to ensure all the inputs to log() were positive: log predictors notebook.
(please disregard the repeat E02100 non-log/log charts, I don't know why they were copied)
This would indicate why our predictions are better when log transforming some of our predictors, but also that other predictors might not benefit from being log transformed. I would also be open to adding a residual analysis to this notebook, as I understand its residuals that may be more important to this question of log-transforming predictors.