Imbalanced Dataset vs Cost Function
Hi,
@albahnsen Do you have any thoughts on the relationship between the cost matrix and re-balancing the data? I notice that you do not rebalance in your final logistic regression model in your wonderful paper.
If I have a highly imbalanced dataset where: <1% are Positive 99% are Negative
But the theoretical cost is: 30 if all are labelled positive and 1 if all are labelled negative
What should I be adjusting to stop it predicting all Positive? The imbalance? The cost? THe iterations?
Thanks!
Edit: I've done some Cross Validation to check different C aand max_iter but it seems like the best savings score I can get it 0 (with the worst being -12).
Hi.
If you're assuming a constant cost between errors, it is the same doing a balancing of the input dataset than adjusting the threshold doing the cost. However, if the costs are example-dependent, balancing the dataset does not give you optimal results.
Edit: I've done some Cross Validation to check different C and max_iter but it seems like the best savings score I can get it 0 (with the worst being -12). That look quite suspicious.
Thanks for the response @albahnsen !:) Can I confirm the columns of the cost - matrix are? false positives, false negatives, true positives and true negatives
When I print out the model history (view of the iterations), it suggests the cost per example for the best model is: $0.805161. However, when I manually get the savings score cost, cost_base, savings_p = savings_score(y_vec, train_predictions, cost_mat)
The cost per example much higher at the cost per alert and the model predicts all fraud.
C= 1.0 - no regularization because I was suspicious about the loss function.
The cost-matrix and loss function appear fine so the problem is with the optimisation of the function.
Now the reason I suggested downsampling is because whereas you had 0.5% true fraud in your example, my example is more like 0.05% or worse. =( Therefore the optimisation tends to converse to predicting a single class. This is not ideal.