survHE memory usage tipping point

Hi Gianluca

I have run into a problem when trying to build a model and have recreated it below using the bc data augmented with some dummy variables.

Both model 7 and model 9 below are trained very fast with little memory overhead (from the viewpoint of Windows task manager). Model 8 however, which includes all the terms from 7 and 9, doesn't complete due to eating up my 4GB of spare working memory.

# get some data
data(bc)
N<-nrow(bc)

# create some dummy categorical variables
bc$x1<-round(2*runif(N))
bc$x2<-round(3*runif(N))

# create some 'continuous' variables
bc$x3<-round(10*runif(N))
bc$x4<-round(33*runif(N))

# create indicator variables for the levels of categorical variables to allow interactions
bc$x1.1<-as.factor(bc$x1==1)
bc$x1.2<-as.factor(bc$x1==2)
bc$x2.1<-as.factor(bc$x2==1)
bc$x2.2<-as.factor(bc$x2==2)
bc$x2.3<-as.factor(bc$x2==3)

# create interactions
bc$x1.x2<-as.factor(bc$x1*bc$x2)
bc$x1.1.x3<-as.numeric(bc$x1.1)*bc$x3
bc$x1.2.x3<-as.numeric(bc$x1.2)*bc$x3

# modelling with interaction of categorical variables
   form7<-with(bc, Surv(rectime, censrec)~group+x1.1+x1.2+x2.1+x2.2+x2.3+x1.x2+x3+x4)
      m7<-fit.models(form7, data=bc, distr="rps", k=1)
print(m7)

# modelling with interactions of categorical variable with categorical and continuous variable
   form8<-with(bc, Surv(rectime, censrec)~group+x1.1+x1.2+x2.1+x2.2+x2.3+x1.x2+x3+x4+x1.1.x3+x1.2.x3)
      m8<-fit.models(form8, data=bc, distr="rps", k=1)
print(m8)

# modelling with interaction of categorical variable with continuous variable
   form9<-with(bc, Surv(rectime, censrec)~group+x1.1+x1.2+x3+x1.1.x3+x1.2.x3)
      m9<-fit.models(form9, data=bc, distr="rps", k=1)
print(m9)

Oct 17 '18 09:10 Geoff-Holmes

Hi Geoff, I don't think it's necessarily surprising... I personally find interactions with variables with multiple values rather complex to interpret and analyse anyway. So first of all, I wonder whether you could/should consider constructing specific interactions (eg re-group the variables as low/high and then have interactions to mean both at the low level, both at the high level and the two cross-terms)?

Did you tried to see if the problem is specific to RPS or can you reproduce it for other distributions? And does it change much to use RPS with k=1 --- ie does a single knot improve the fit massively in comparison to the Weibull (which would be reference distribution at k=0)?

Finally, you're kind of exploding the terms here --- there are very many categories in your interactions! May be an issue with memory but equally not an awful lot of data to estimate these many parameters?...

Oct 18 '18 13:10 giabaio

Hi Gianluca I tried with a few other distributions (Weibull, genF) and also rps with k=0, and in all these cases it worked fine. The model also works fine with flexsurvspline with any number of knots (up to 5 anyway). In the MLE estimation is it passed to flexsurv in any case? In which case it should presumably work okay.

I found with survHE I had to split the interacting categorical covariates down into indicators to get it to work in the simpler cases. In the data I'm really interested in I have N=16,000, and I have found that the one internal knot makes a significant difference (to the AIC).

Oct 24 '18 08:10 Geoff-Holmes

survHE survHE copied to clipboard

memory usage tipping point

survHE
survHE copied to clipboard