gbm.step() doesn't iterate for large continuous response variables

Open aosmith16 opened this issue 4 years ago • 0 comments

The iteration loop in gbm.step() doesn't ever start for some large-value continuous response variables.

library(dismo)
data(Anguilla_train)
Anguilla_train = Anguilla_train[1:200,]

fitcont = gbm.step(data = Anguilla_train, gbm.x = c(3:5, 7:14), gbm.y = 6, family = "gaussian",
                   tree.complexity = 5, learning.rate = 0.01, bag.fraction = 0.5)

#>  GBM STEP - version 2.9 
#>  
#> Performing cross-validation optimisation of a boosted regression tree model 
#> for DSDist and using a family of gaussian 
#> Using 200 observations and 11 predictors 
#> creating 10 initial models of 50 trees 
#> 
#>  folds are unstratified 
#> total mean deviance =  8013.378 
#> tolerance is fixed at  8.0134 
#> ntrees resid. dev. 
#> 50    5300.797 
#> now adding trees...
 

#> mean total deviance = 8013.378 
#> mean residual deviance = 4755.488 
#>  
#> estimated cv deviance = 5300.796 ; se = 365.367 
#>  
#> training data correlation = 0.848 
#> cv correlation =  0.707 ; se = 0.075 
#>  
#> elapsed time -  0.01 minutes

I poked around a bit in gbm.step() and I believe this is caused by the delta.deviance variable that is used as a condition in the while() loop that iterates through the number of trees by the step size. This variable has been hard-coded to be 1 before starting the loop, which works great for family = "bernoulli" and for smaller range continuous variables.

For some continuous variables with a large range, the while loop condition delta.deviance > tolerance.test can never be met when delta.deviance is 1 and the tolerance.test is mean.total.deviance * tolerance. In such cases, like the example above, the while loop never starts since its conditions are never met.

I tried changing the hard-coded delta.deviance from 1 to mean.total.deviance and things appeared to work fine for bernoulli and gaussian models. However, I don't know what other repercussions this has.

Another option to bypass this problem without changing the function is to make the tolerance really small for such variables so tolerance.test goes below 1 (but this may have other impacts) or to scale the response variable. If these are the best fixes, maybe add them as suggestions in the documentation?

^{Created on 2021-06-16 by the reprex package (v2.0.0)}

Jun 16 '21 15:06 aosmith16