isds2020 Inverted validation curve

Inverted validation curve

Open aabk-bkaa opened this issue 5 years ago • 1 comments

trafficstars

After fitting our model it appears that our validation curve is inverted:

The validation RMSE is systematically lower than the training RMSE which does not make intuitive sense to us.

The modelling was produced with the following code:

` X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3, random_state=1)

lambdas = np.logspace(0, 8, 12)

folds = KFold(n_splits = 5) MSE_list =[]

for _lambda in tqdm(lambdas): pipe_preproc = make_pipeline(PolynomialFeatures(2),StandardScaler(), Lasso(alpha = _lambda, max_iter = 1000)) MSE_train = [] MSE_list_intermediate = []

for train_index, val_index in tqdm(folds.split(X_train,y_train)):
    
    X_tr, y_tr = X_train.iloc[train_index], y_train.iloc[train_index]
    X_val, y_val = X_train.iloc[val_index], y_train.iloc[val_index]

    MSE_list_intermediate.append(mse(y_val,pipe_preproc.fit(X_tr,y_tr).predict(X_val))**(1/2))
    
    MSE_train.append(mse(y_train,pipe_preproc.fit(X_tr,y_tr).predict(X_train))**(1/2))

MSE_list.append([_lambda] + MSE_list_intermediate + [np.mean(MSE_list_intermediate)] + [np.mean(MSE_train)])

MSE = pd.DataFrame(MSE_list) MSE.columns = ["Lambda", "Fold 1", "Fold 2","Fold 3","Fold 4","Fold 5","Mean_RMSE", "Mean_RMSE_Evaluation"]

MSE.to_excel("LASSO_output.xlsx") `

Can anybody help us.

Kind regards Anton and Søren

Aug 25 '20 08:08 aabk-bkaa

hi @aabk-bkaa, assuming that you did not plot the data and label the curves incorrectly, there could be other reasons for the RMSE being lower on the validation data than on the training data. See: https://stats.stackexchange.com/questions/187335/validation-error-less-than-training-error

Aug 25 '20 09:08 jsr-p

isds2020 isds2020 copied to clipboard

Inverted validation curve

isds2020
isds2020 copied to clipboard