survHE
survHE copied to clipboard
memory usage tipping point
Hi Gianluca
I have run into a problem when trying to build a model and have recreated it below using the bc data augmented with some dummy variables.
Both model 7 and model 9 below are trained very fast with little memory overhead (from the viewpoint of Windows task manager). Model 8 however, which includes all the terms from 7 and 9, doesn't complete due to eating up my 4GB of spare working memory.
# get some data
data(bc)
N<-nrow(bc)
# create some dummy categorical variables
bc$x1<-round(2*runif(N))
bc$x2<-round(3*runif(N))
# create some 'continuous' variables
bc$x3<-round(10*runif(N))
bc$x4<-round(33*runif(N))
# create indicator variables for the levels of categorical variables to allow interactions
bc$x1.1<-as.factor(bc$x1==1)
bc$x1.2<-as.factor(bc$x1==2)
bc$x2.1<-as.factor(bc$x2==1)
bc$x2.2<-as.factor(bc$x2==2)
bc$x2.3<-as.factor(bc$x2==3)
# create interactions
bc$x1.x2<-as.factor(bc$x1*bc$x2)
bc$x1.1.x3<-as.numeric(bc$x1.1)*bc$x3
bc$x1.2.x3<-as.numeric(bc$x1.2)*bc$x3
# modelling with interaction of categorical variables
form7<-with(bc, Surv(rectime, censrec)~group+x1.1+x1.2+x2.1+x2.2+x2.3+x1.x2+x3+x4)
m7<-fit.models(form7, data=bc, distr="rps", k=1)
print(m7)
# modelling with interactions of categorical variable with categorical and continuous variable
form8<-with(bc, Surv(rectime, censrec)~group+x1.1+x1.2+x2.1+x2.2+x2.3+x1.x2+x3+x4+x1.1.x3+x1.2.x3)
m8<-fit.models(form8, data=bc, distr="rps", k=1)
print(m8)
# modelling with interaction of categorical variable with continuous variable
form9<-with(bc, Surv(rectime, censrec)~group+x1.1+x1.2+x3+x1.1.x3+x1.2.x3)
m9<-fit.models(form9, data=bc, distr="rps", k=1)
print(m9)
Hi Geoff, I don't think it's necessarily surprising... I personally find interactions with variables with multiple values rather complex to interpret and analyse anyway. So first of all, I wonder whether you could/should consider constructing specific interactions (eg re-group the variables as low/high and then have interactions to mean both at the low level, both at the high level and the two cross-terms)?
Did you tried to see if the problem is specific to RPS or can you reproduce it for other distributions? And does it change much to use RPS with k=1 --- ie does a single knot improve the fit massively in comparison to the Weibull (which would be reference distribution at k=0)?
Finally, you're kind of exploding the terms here --- there are very many categories in your interactions! May be an issue with memory but equally not an awful lot of data to estimate these many parameters?...
Hi Gianluca
I tried with a few other distributions (Weibull, genF) and also rps with k=0, and in all these cases it worked fine. The model also works fine with flexsurvspline with any number of knots (up to 5 anyway).
In the MLE estimation is it passed to flexsurv in any case? In which case it should presumably work okay.
I found with survHE I had to split the interacting categorical covariates down into indicators to get it to work in the simpler cases.
In the data I'm really interested in I have N=16,000, and I have found that the one internal knot makes a significant difference (to the AIC).