rms icon indicating copy to clipboard operation
rms copied to clipboard

validate() for rms::ols: Error in lsfit(x, y) : only 0 cases, but 2 variables

Open Deleetdk opened this issue 6 years ago • 5 comments

I get a strange sounding error when trying to use validate() on a fitted ols:

Error in lsfit(x, y) : only 0 cases, but 2 variables

The dataset has n=1890 with about 400 predictors in the model. Almost all the predictors are dichotomous dummies indicating whether some regex pattern matched a name or not. Some of these only have a few true cases (but at least 10). This is a preliminary fit before I am doing some penalization to improve the model fit and final predictors (done with LASSO in glmnet). However, I wanted to validate the validity of the initial model. My guess is that the error occurs due to the resampling ending up with no cases for a given variable in the training set, which causes it to fail to fit / not be able to use that variable in the prediction in the test set.

For a reproducible example, here's a similar dataset based on iris:

#sim some data
iris2 = iris
set.seed(1)
iris2$letter1 = sample(letters, size = 150, replace = T)
iris2$letter2 = sample(letters, size = 150, replace = T)
iris2$letter3 = sample(letters, size = 150, replace = T)

#fit
(fit = rms::ols(Sepal.Width ~ letter1 + letter2 + letter3 + Petal.Width, Petal.Length, data = iris2, x = T, y = T))
validate(fit)

Gives:

Error in lsfit(x, y) : only 0 cases, but 2 variables
In addition: Warning message:
In lsfit(x, y) : 150 missing values deleted

The dataset has no missing data.

In my own simple cross-validation implementation discussed here, I got around this issue by simply ignoring runs that produce errors. See this question: https://stats.stackexchange.com/questions/213837/k-fold-cross-validation-nominal-predictor-level-appears-in-the-test-data-but-no Maybe this too should be done for rms?

Deleetdk avatar Dec 14 '17 07:12 Deleetdk

Thanks for the report. There was a bug for ols for validate and calibrate where singular fits were reporting NAs instead of setting fail=TRUE so that that sample would be ignore. This is fixed for the next release.

harrelfe avatar Dec 15 '17 02:12 harrelfe

Updating to the Github version, validate no longer throws as error, but it gives useless output for my use case as all 40 runs failed:

> validate(ols_fit)

Divergence or singularity in 40 samples
          index.orig training test optimism index.corrected n
R-square       0.572      NaN  NaN      NaN             NaN 0
MSE            0.425      NaN  NaN      NaN             NaN 0
g              0.000      NaN  NaN      NaN             NaN 0
Intercept      0.000      NaN  NaN      NaN             NaN 0
Slope          1.000      NaN  NaN      NaN             NaN 0

In the iris example case, it is also almost useless. Despite 40 runs, only 2 completed:

> validate(fit)

Divergence or singularity in 38 samples
          index.orig training   test optimism index.corrected n
R-square      0.5504   0.8728 -0.931    1.804         -1.2536 2
MSE           0.0848   0.0234  0.364   -0.341          0.4258 2
g             0.3504   0.4573  0.191    0.266          0.0845 2
Intercept     0.0000   0.0000  2.177   -2.177          2.1766 2
Slope         1.0000   1.0000  0.289    0.711          0.2886 2

My guess is the same as before: one has to use special sampling to avoid the issue. As someone on Cross Validated suggested:

You could look into stratified sampling, i.e. constraining your train/test splits so that they have (approximately) the same relative frequencies for your predictor levels.

However, I think it worth considering whether the current behavior is actually wanted: So random splitting with non-negligible frequency results in sets that do not cover all predictor levels. Can you consider such a set representative for whatever the application is? I've been working with such small sample sizes and went for stratified splitting. But I insist that thinking hard about the data and the consequences of working with such small samples is at least as necessary as fixing the pure computational error.

Deleetdk avatar Dec 16 '17 22:12 Deleetdk

The behavior you saw is the intended behavior when the sample size does not support a large number of parameters. You'll need to reduce the number of parameters in the model.

harrelfe avatar Dec 16 '17 22:12 harrelfe

How do you recommend that I validate models that contain a large number of logical predictors without running into this issue?

Deleetdk avatar Jan 09 '18 03:01 Deleetdk

You have too many parameters in the model.

harrelfe avatar Jan 09 '18 04:01 harrelfe