tpot icon indicating copy to clipboard operation
tpot copied to clipboard

TPOT internal cross validation is not shuffled - can bias results if not accounted for.

Open perib opened this issue 2 years ago • 0 comments

TPOTs internal cross validation is not shuffled. Rather, data is split in sequential chunks in the order that it was passed in. (e.g indexes 1-10 is chunk 1, 11-20 is chunk 2, 21-30 is chunk 3, etc.) This could lead to biased results if the data was ordered in a particular way before being given to TPOT, which is common in many cases. TPOT's documentation also does not mention this issue, so users may not know to shuffle their data before passing into TPOT.

TPOT gets its cross validation loop from the check_cv function from sklearn in line 1507 of base.py.

This returns either a StratifiedKFold or KFold class (return statement here)

By default these have shuffle set to False. Documentation is for StratifiedKFold here and KFold here.

Possible solutions

Shuffle and random state could be set after defining the cv instance.

cv.shuffle = True cv.random_state = self.random_state

This could also be a user set parameter of TPOTs constructor for clarity.

TPOTs documentation should reference this somehow so that users understand how to format their data before passing it into TPOT.

perib avatar May 17 '22 21:05 perib