dmba
dmba copied to clipboard
GainsChart() should include a measure of the total Actual value
I may be mistaken, but the GainsChart() line for a random draw should be based on the total actual values for prediction (or total number of actual occurences for classification) and not the total predicted values:
nActual = gains.sum() # number of desired records
"gains" is the list of predicted values.
The issue is not with the function, I see the issue is with passing predicted values instead of actual values (sorted by the predicted values) in the code for the Figure 5.2 on page 132
pred_v = pd.Series(reg.predict(valid_X)) pred_v = pred_v.sort_values(ascending=False)
pred_v needs to be actual actual prices sorted by these predictions when passed to GainsChart()
Hello Matt,
You are right, we identified this issue in the book about a year ago and (hopefully) fixed it with the following code. I can see that Wiley hasn't corrected the problem in the electronic version yet. It was also not yet corrected in the code available through the book's website.
Code for Figure 5.2:
# sort the actual values in descending order of the prediction
df = pd.DataFrame({
'predicted': reg.predict(valid_X),
'actual': valid_y,
})
df = df.sort_values(by=['predicted'], ascending=False)
fig, axes = plt.subplots(nrows=1, ncols=2)
ax = gainsChart(df['actual'], ax=axes[0])
ax.set_ylabel('Cumulative Price')
ax.set_title('Cumulative Gains Chart')
ax = liftChart(df['actual'], ax=axes[1], labelBars=False)
ax.set_ylabel('Lift')
plt.tight_layout()
plt.show()
The code for figure 10.3 also needs correcting:
df = logit_result.sort_values(by=['p(1)'], ascending=False)
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
gainsChart(df.actual, ax=axes[0])
liftChart(df.actual, title=False, ax=axes[1])
plt.tight_layout()
plt.show()
In this case, the call to liftChart
was incorrect.