pyoptree Prediction use wrong values in normalization

Prediction use wrong values in normalization

Open victormrsilva opened this issue 4 years ago • 1 comments

Hi! First of all, great work on implementing this!

I was checking your code and I catch an error in prediction. Since I couldn't see the same amount of elements in leaves when I was checking the generated tree, I noticed that in the predict function you use the max and min value of the train dataset instead of the test dataset.

def predict(self, data: pd.DataFrame): if not self.is_trained: raise ValueError("Model has not been trained yet! Please usetrain()` to train the model first!")

    new_data = data.copy()
    new_data_cols = data.columns
    for col in self.P_range:
        if col not in new_data_cols:
            raise ValueError("Column {0} is not in the given data for prediction! ".format(col))
        col_max, col_min = self.normalizer[col] <====== this line

So, I just modified to get the maximum and minimum value for the column of the test set: col_max = max(data[col]) col_min = min(data[col]) # col_max, col_min = self.normalizer[col]

I don't know if this was a mistake or intentional. If it was intentional, why it is this way?

Thanks and, again, great work!

Nov 26 '20 15:11 victormrsilva

Hi Victor,

Thanks for pointing this out! But the prediction data needs to be normalized as same as the training data, so the col_max and col_min from the training data were used, which is why I used self.normalizer[col].

Thanks, Meng

Nov 30 '20 03:11 pan5431333

pyoptree pyoptree copied to clipboard

Prediction use wrong values in normalization

pyoptree
pyoptree copied to clipboard