h2o-3 icon indicating copy to clipboard operation
h2o-3 copied to clipboard

PUBDEV-8577: GLM: limit number of iterations when training the final model after CV.

Open tomasfryda opened this issue 3 years ago • 1 comments

https://h2oai.atlassian.net/browse/PUBDEV-8577

This PR uses max(cv_model[i].iteration) + 1, another option is to use average but there were several GAM tests that failed when using the average.

tomasfryda avatar Feb 14 '22 18:02 tomasfryda

@tomasfryda Here is my thoughts for GLM in terms of controlling runtime.

Assume serial model building: If we assume that the runtime of GLM is proportional to the dataset size (say N), and say we allow for NFold cv. Then the run time is calculated as:

for cv runs, dataset size is N/NFold, number of cv model runs is NFold. Hence the total run time is NFold * (N/NFold) proportional to N.

For the main model, the run time is proportional to N.

The total run time of doing CV is proportional to 2*N.

So, it almost feels like that we should allocate maxruntime/2 to building the CV models and maxruntime/2 to building the main model. In this case, you can restrict the main model to run with iterations derived from cv runs.

If we can have parallel model building, then it takes about N/NFold to build the CV models and N to build the main model. Then, total runtime is proportional to N/NFold + N. In this case, you will allocate about 1/(Nfold+1)*maxruntime to run all the cv models and NFold/(1+NFold)*maxruntime to running the main model. In this case, you can even set the main model to run with max(iterations from all cv models).

The reality is probably somewhere in between building cv models in serial and building all cv models in parallel.

Either way, I think your idea is sound.

wendycwong avatar Apr 27 '22 23:04 wendycwong

@wendycwong The comment you wrote was for functionality implemented in another PR on which this PR was based on.

The former, which dealt with time allocation, is already merged.

This PR is just about ensuring that the main model won't use more iteration than the cv models. The functionality is similar to what's happening in DeepLearning but instead of mean I use max to be more safe.

My guess is that GLM is less prone to overfitting than DeepLearning and so using the max shouldn't cause issues with overfitting (also now iterations are unbounded for the main model) and it also shouldn't cause underfitting. Also when using the mean some tests failed and I don't want to break them, especially if they have fixed results (possibly gained from R or other implementation of GLM).

tomasfryda avatar Nov 30 '22 16:11 tomasfryda

It is more conservative to use maximum then average.

wendycwong avatar Dec 11 '22 22:12 wendycwong