tpot Having trouble adding classifier metric + question on which metric should be used from existing ones

I'm trying to add a couple of binary classification metrics for TPOT. I have a small dataset and a binary classification problem with one class being 40% more than the other. I guess you would call that a moderately unbalanced dataset? I care equally about either of the two classes and I want the metric to be as stringent as possible. From some reading I've done Cohen Kappa and Matthews correlation coefficient are good ones for binary classification and log loss heavily penalizes when the model is confident about something yet wrong.

I'm pretty novice when it comes to programming but I've tried adding Log Loss, for example, using your guide but I get the following error:

There was an error in the TPOT optimization process. This could be because the data was not formatted properly, or because data for a regression problem was provided to the TPOTClassifier object. Please make sure you passed the data to TPOT correctly. If you enabled PyTorch estimators, please check the data requirements in the online documentation: https://epistasislab.github.io/tpot/using/ I've written the following code

# Make a custom metric function - LOG Loss
from sklearn.metrics import log_loss
def my_custom_accuracy(y_true, y_pred):
    loss = log_loss(y_true,y_pred)
    return loss



# Make a custom a scorer from the custom metric function
# Note: greater_is_better=False in make_scorer below would mean that the scoring function should be minimized.
from sklearn.metrics import make_scorer
my_custom_scorer = make_scorer(my_custom_accuracy, greater_is_better=True)

In either case, which one do you recommend using from the existing TPOT ones?

Thank you for your help in the matter and hope to hear from you soon. This is the final step of my analysis and I want to get it right... :/

Oct 05 '20 13:10 apavlo89

You may use scoring="neg_log_loss" for using Log Loss in TPOT (check this link). BTW, this metric should be greater_is_better=False when using make_scorer function.

Also, I suggested to use scoring="balanced_accuracy" (check this link)

Oct 05 '20 14:10 weixuanfu

Thank you @weixuanfu ,I didn't realise log loss is already included, great! If I just call it from the default TPOT installation then greater_is_better will be set automatically to false yes? I think I'm going to go with balanced_accuracy first as it looks good for binary classification and then log loss and see which one gives better accuracy. Am I right to think that Cohen Kappa and Matthews correlation coefficient is not included in TPOT?

Oct 05 '20 14:10 apavlo89

If I just call it from the default TPOT installation then greater_is_better will be set automatically to false yes? No, it should be set to False manually before using it in TPOT.

Am I right to think that Cohen Kappa and Matthews correlation coefficient is not included in TPOT?

Yes, both metrics was not included for simplify usage via a string in TPOT. You can use make_scorer with metric function from scikit-learn to generate the scorer and then use it in TPOT.

Oct 05 '20 14:10 weixuanfu

How would one go about calling neg_log_loss? I've put it in scoring='neg_log_loss' and I declared greater_is_better=False but I still get the same error

Oct 06 '20 09:10 apavlo89

Could you please provide your codes with a example data for reproducing this issue?

Oct 06 '20 12:10 weixuanfu

Thank you for looking at this. I'm putting my code first to see whether something jumps out.

import pandas as pd
import numpy as np
from tpot import TPOTClassifier
import time

tic = time.perf_counter()
dataset = pd.read_csv('D:/data.csv')
dataset_list = list(dataset.columns)
dataset= dataset.drop([('Participant'), ('AAT_1'),  ('AAT_DB')], axis = 1)
dataset_list = list(dataset.columns)


AAT_GROUP_2_dict = {'LOW':1,
                        'HIGH':2}

dataset['AAT_GROUP_2'] = dataset.AAT_GROUP_2.map(AAT_GROUP_2_dict)

y = np.array(dataset[('AAT_GROUP_2')]) 
X = pd.get_dummies(dataset)
X = X.drop([('AAT_GROUP_2')], axis = 1)
X = np.array(X)



# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.80, test_size = 0.20, random_state = 16)



from sklearn.model_selection import LeaveOneOut
loocv = LeaveOneOut()

#log loss requirement##############################
greater_is_better=False
##################################################



tpot = TPOTClassifier(generations=500, population_size=100, verbosity=2, random_state=16, early_stop = 50, cv = loocv, scoring ='neg_log_loss', n_jobs= -1)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

tpot.export('tpot_GROUP_2_pipeline.py')
toc = time.perf_counter()

print(f"Finished running the code in {toc - tic:0.4f} seconds")

Oct 06 '20 16:10 apavlo89

I found the issue, neg_log_loss doesn't work with LOOCV. Not 100% sure why that is.

Oct 06 '20 16:10 apavlo89

from tpot import TPOTClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import log_loss, make_scorer
import numpy as np
X, y = make_classification(n_samples=20, n_features=10, random_state=16)

loocv = LeaveOneOut()

my_neg_log_loss_scorer = make_scorer(log_loss, greater_is_better=False,
                                  needs_proba=True, labels=np.unique(y))
  
tpot = TPOTClassifier(generations=500, population_size=100, verbosity=2, random_state=16, 
                early_stop = 50, cv = loocv, scoring =my_neg_log_loss_scorer, n_jobs=-1)
tpot.fit(X, y)
print(tpot.score(X_test, y_test))

You need use labels parameter in log_loss scoring function when using LOOCV since only one sample is in test set in each splits of LOOCV. Please check the demo above.

Oct 06 '20 17:10 weixuanfu

Fu you are a rockstar! Thanks, it works great now. The number I am getting through the generations is minus (e.g., -0.15) but as it works through the generations it goes closer to 0 which I think that means it is working fine. I'm still running the code and its getting closer to 0. Will report again when I've made enough algorithmic babies reach no further improvement

Oct 07 '20 19:10 apavlo89

Well I got a low balanced_accuracy score from a log_loss of 0.053 and then I realised that I included the X, y = make_classification(n_samples=20, n_features=263, random_state=16) that you posted above so I basically overwrote my database with a randomly generated database lol. Re-running again...

One last thing. I also want to try running Matthews correlation coefficient in TPOT.

Apparently its better than an f1 score for binary classification? https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6413-7

How would I go about implementing it? Is it this:

#############matthews corr coef #############################################
# # Make a custom metric 
from sklearn.metrics import matthews_corrcoef
my_matthews_corrcoef_scorer = make_scorer(matthews_corrcoef, greater_is_better=True)
######################################################################################

??

Oct 08 '20 11:10 apavlo89

How would I go about implementing it? Is it this:

#############matthews corr coef #############################################
# # Make a custom metric 
from sklearn.metrics import matthews_corrcoef
my_matthews_corrcoef_scorer = make_scorer(matthews_corrcoef, greater_is_better=True)
######################################################################################

??

Looks good.

balanced_accuracy may not work well in your case that the sample size is very small.

Oct 08 '20 13:10 weixuanfu

@weixuanfu Great, I will run TPOT with it after log loss run finishes. In your opinion which accuracy metric would you use in my case? Just to reiterate - samples = 20, features = 263, the target is binary and one class is higher than the other (12 vs 8). Again thank you for all your help.

Oct 08 '20 14:10 apavlo89

You can simply use accuracy or f1 score on all of 20 samples that used in TPOT. It is noted that it is a training score instead of holdout/test score. I think it is hard to make train/test splits for your case.

Oct 08 '20 14:10 weixuanfu

You can simply use accuracy or f1 score on all of 20 samples that used in TPOT. It is noted that it is a training score instead of holdout/test score. I think it is hard to make train/test splits for your case.

Yes, I agree train_test_split is definitely not good in my case, I realised this the day I posted this thread... I've been using the whole dataset (X,y) with LOOCV for a few days now. Am I right to think that using TPOT with log loss has a higher chance of yielding a better accuracy score (whether that is accuracy or f1) or is this faulty logic and I should just train TPOT with accuracy/f1 from the get-go?

Oct 08 '20 15:10 apavlo89

Sorry, I am not familiar with log loss so I am not sure whether it is logic for this case. It also depends on your study objectives. If f1 score is more important for your study, then you should use f1 score.

Oct 08 '20 15:10 weixuanfu

There is definitely something different with Log loss that makes it better or makes it seem better. Training TPOT with the accuracy metric gives me a max accuracy score of 95%. Training TPOT with log loss gives me an accuracy score of 100%, whether that is accuracy or balanced accuracy, or f1_weighted score. Training with log_loss gives me a more complex pipeline for example it has given me the following pipeline:

# Average CV score on the training set was: -9.992007221626413e-16 <- explanation for this number after the code)
exported_pipeline = make_pipeline(
    RobustScaler(),
    StackingEstimator(estimator=GradientBoostingClassifier(learning_rate=0.5, max_depth=9, max_features=0.8500000000000001, min_samples_leaf=1, min_samples_split=3, n_estimators=100, subsample=0.6000000000000001)),
    StackingEstimator(estimator=LinearSVC(C=10.0, dual=True, loss="hinge", penalty="l2", tol=1e-05)),
    StackingEstimator(estimator=LinearSVC(C=10.0, dual=True, loss="squared_hinge", penalty="l2", tol=0.001)),
    VarianceThreshold(threshold=0.1),
    OneHotEncoder(minimum_fraction=0.05, sparse=False, threshold=10),
    Normalizer(norm="max"),
    StackingEstimator(estimator=GradientBoostingClassifier(learning_rate=0.5, max_depth=9, max_features=0.8500000000000001, min_samples_leaf=1, min_samples_split=3, n_estimators=100, subsample=0.6000000000000001)),
    StackingEstimator(estimator=LinearSVC(C=0.5, dual=True, loss="squared_hinge", penalty="l2", tol=0.001)),
    VarianceThreshold(threshold=0.05),
    RandomForestClassifier(bootstrap=False, criterion="gini", max_features=0.7500000000000001, min_samples_leaf=1, min_samples_split=7, n_estimators=100)
)

The only weird thing is that the number is negative. Log loss numbers should be positive with the lowest number being 0 (best) and 1 being the highest (worst). Generation 1 was -.42 and once it reached -.013, the subsequent generation gave a log loss score of -9.992007221626413e-16 instead of stopping at 0 (which should be the best log loss score possible).

What do you make of all of this? Is log_loss perhaps way too overfitting? Have I stumbled upon the discovery of the century? I don't know LOL

Oct 10 '20 15:10 apavlo89

Another run with log loss TPOT training run gave me this (went close to 0 and once it flipped to a value higher than 0 or 1 - in this case -9.992007221626413e-16), I stopped TPOT:

exported_pipeline = make_pipeline(
    make_union(
        FunctionTransformer(copy),
        FunctionTransformer(copy)
    ),
    VarianceThreshold(threshold=0.001),
    StackingEstimator(estimator=SGDClassifier(alpha=0.01, eta0=0.01, fit_intercept=True, l1_ratio=0.75, learning_rate="constant", loss="squared_hinge", penalty="elasticnet", power_t=0.0)),
    StackingEstimator(estimator=ExtraTreesClassifier(bootstrap=True, criterion="entropy", max_features=0.6000000000000001, min_samples_leaf=10, min_samples_split=13, n_estimators=100)),
    StandardScaler(),
    SelectPercentile(score_func=f_classif, percentile=2),
    StackingEstimator(estimator=DecisionTreeClassifier(criterion="gini", max_depth=2, min_samples_leaf=1, min_samples_split=6)),
    StackingEstimator(estimator=XGBClassifier(learning_rate=0.001, max_depth=8, min_child_weight=5, n_estimators=100, nthread=1, subsample=0.3)),
    RandomForestClassifier(bootstrap=False, criterion="gini", max_features=0.9500000000000001, min_samples_leaf=1, min_samples_split=5, n_estimators=100)
)

again this gives me 100% accuracy. Interested to hear your thoughts on this.

I also have a question on the order of the pipeline.

What was essentially done was that the following:

Firsly, preprocessing steps were done: StandardScaler()
Then the following feature selection steps were done: SelectPercentile(score_func=f_classif, percentile=2), VarianceThreshold(threshold=0.001)
Then the following stacking estimators were used:

StackingEstimator(estimator=SGDClassifier(alpha=0.01, eta0=0.01, fit_intercept=True, l1_ratio=0.75, learning_rate="constant", loss="squared_hinge", penalty="elasticnet", power_t=0.0)),
 StackingEstimator(estimator=ExtraTreesClassifier(bootstrap=True, criterion="entropy", max_features=0.6000000000000001,  StackingEstimator(estimator=DecisionTreeClassifier(criterion="gini", max_depth=2, min_samples_leaf=1, min_samples_split=6)), StackingEstimator(estimator=XGBClassifier(learning_rate=0.001, max_depth=8, min_child_weight=5, n_estimators=100, nthread=1, subsample=0.3))

Based on the predictions of the 4 estimators, RandomForest (RandomForestClassifier(bootstrap=False, criterion="gini", max_features=0.9500000000000001, min_samples_leaf=1, min_samples_split=5, n_estimators=100) was used to make the final classifier prediction decision.

Is this correct?

Oct 11 '20 16:10 apavlo89

tpot tpot copied to clipboard

Having trouble adding classifier metric + question on which metric should be used from existing ones

tpot
tpot copied to clipboard