h2o-3
h2o-3 copied to clipboard
PUBDEV-8577: GLM: limit number of iterations when training the final model after CV.
https://h2oai.atlassian.net/browse/PUBDEV-8577
This PR uses max(cv_model[i].iteration) + 1
, another option is to use average
but there were several GAM tests that failed when using the average.
@tomasfryda Here is my thoughts for GLM in terms of controlling runtime.
Assume serial model building: If we assume that the runtime of GLM is proportional to the dataset size (say N), and say we allow for NFold cv. Then the run time is calculated as:
for cv runs, dataset size is N/NFold, number of cv model runs is NFold. Hence the total run time is NFold * (N/NFold) proportional to N.
For the main model, the run time is proportional to N.
The total run time of doing CV is proportional to 2*N.
So, it almost feels like that we should allocate maxruntime/2 to building the CV models and maxruntime/2 to building the main model. In this case, you can restrict the main model to run with iterations derived from cv runs.
If we can have parallel model building, then it takes about N/NFold to build the CV models and N to build the main model. Then, total runtime is proportional to N/NFold + N. In this case, you will allocate about 1/(Nfold+1)*maxruntime to run all the cv models and NFold/(1+NFold)*maxruntime to running the main model. In this case, you can even set the main model to run with max(iterations from all cv models).
The reality is probably somewhere in between building cv models in serial and building all cv models in parallel.
Either way, I think your idea is sound.
@wendycwong The comment you wrote was for functionality implemented in another PR on which this PR was based on.
The former, which dealt with time allocation, is already merged.
This PR is just about ensuring that the main model won't use more iteration than the cv models. The functionality is similar to what's happening in DeepLearning but instead of mean
I use max
to be more safe.
My guess is that GLM is less prone to overfitting than DeepLearning and so using the max
shouldn't cause issues with overfitting (also now iterations are unbounded for the main model) and it also shouldn't cause underfitting. Also when using the mean
some tests failed and I don't want to break them, especially if they have fixed results (possibly gained from R or other implementation of GLM).
It is more conservative to use maximum then average.