isds2020
isds2020 copied to clipboard
Inverted validation curve
After fitting our model it appears that our validation curve is inverted:

The validation RMSE is systematically lower than the training RMSE which does not make intuitive sense to us.
The modelling was produced with the following code:
` X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3, random_state=1)
lambdas = np.logspace(0, 8, 12)
folds = KFold(n_splits = 5) MSE_list =[]
for _lambda in tqdm(lambdas): pipe_preproc = make_pipeline(PolynomialFeatures(2),StandardScaler(), Lasso(alpha = _lambda, max_iter = 1000)) MSE_train = [] MSE_list_intermediate = []
for train_index, val_index in tqdm(folds.split(X_train,y_train)):
X_tr, y_tr = X_train.iloc[train_index], y_train.iloc[train_index]
X_val, y_val = X_train.iloc[val_index], y_train.iloc[val_index]
MSE_list_intermediate.append(mse(y_val,pipe_preproc.fit(X_tr,y_tr).predict(X_val))**(1/2))
MSE_train.append(mse(y_train,pipe_preproc.fit(X_tr,y_tr).predict(X_train))**(1/2))
MSE_list.append([_lambda] + MSE_list_intermediate + [np.mean(MSE_list_intermediate)] + [np.mean(MSE_train)])
MSE = pd.DataFrame(MSE_list) MSE.columns = ["Lambda", "Fold 1", "Fold 2","Fold 3","Fold 4","Fold 5","Mean_RMSE", "Mean_RMSE_Evaluation"]
MSE.to_excel("LASSO_output.xlsx") `
Can anybody help us.
Kind regards Anton and Søren
hi @aabk-bkaa, assuming that you did not plot the data and label the curves incorrectly, there could be other reasons for the RMSE being lower on the validation data than on the training data. See: https://stats.stackexchange.com/questions/187335/validation-error-less-than-training-error