CostSensitiveClassification icon indicating copy to clipboard operation
CostSensitiveClassification copied to clipboard

Imbalanced Dataset vs Cost Function

Open S-C-H opened this issue 5 years ago • 3 comments

Hi,

@albahnsen Do you have any thoughts on the relationship between the cost matrix and re-balancing the data? I notice that you do not rebalance in your final logistic regression model in your wonderful paper.

If I have a highly imbalanced dataset where: <1% are Positive 99% are Negative

But the theoretical cost is: 30 if all are labelled positive and 1 if all are labelled negative

What should I be adjusting to stop it predicting all Positive? The imbalance? The cost? THe iterations?

Thanks!

Edit: I've done some Cross Validation to check different C aand max_iter but it seems like the best savings score I can get it 0 (with the worst being -12).

S-C-H avatar May 24 '20 22:05 S-C-H

Hi.

If you're assuming a constant cost between errors, it is the same doing a balancing of the input dataset than adjusting the threshold doing the cost. However, if the costs are example-dependent, balancing the dataset does not give you optimal results.

Edit: I've done some Cross Validation to check different C and max_iter but it seems like the best savings score I can get it 0 (with the worst being -12). That look quite suspicious.

albahnsen avatar May 27 '20 22:05 albahnsen

Thanks for the response @albahnsen !:) Can I confirm the columns of the cost - matrix are? false positives, false negatives, true positives and true negatives

When I print out the model history (view of the iterations), it suggests the cost per example for the best model is: $0.805161. However, when I manually get the savings score cost, cost_base, savings_p = savings_score(y_vec, train_predictions, cost_mat)

The cost per example much higher at the cost per alert and the model predicts all fraud.

C= 1.0 - no regularization because I was suspicious about the loss function.

S-C-H avatar Jun 15 '20 22:06 S-C-H

The cost-matrix and loss function appear fine so the problem is with the optimisation of the function.

Now the reason I suggested downsampling is because whereas you had 0.5% true fraud in your example, my example is more like 0.05% or worse. =( Therefore the optimisation tends to converse to predicting a single class. This is not ideal.

S-C-H avatar Jun 16 '20 22:06 S-C-H