rmsb icon indicating copy to clipboard operation
rmsb copied to clipboard

Estimation of negative probabilities with predict

Open albertocarm opened this issue 5 years ago • 5 comments

I send a reproducible example showing negative probabilities. This occurs when one of the 'y' levels is rare, and the predictor has an extreme value in its range, so I have been slow to notice the problem.

set.seed(836)
data<- data.frame(HE6=sample(1:10, 200, replace = TRUE, prob=c(rep(0.1,6),0.01,0.002,0.19,0.198) ), 
                  Age = sample(1:85, 200, replace = TRUE), EORTC = sample(1:100, 200, replace = TRUE), 
                  linf=rbinom(200, 1,.5),
                  cir=rbinom(200, 1,.5),esquema=rbinom(200, 1,.5), riesgo=factor(rbinom(200, 2,.5)), estadio=factor(rbinom(200, 2,.5)))
head(data)
table(data$HE6)
dd<-datadist(data)
options(datadist='dd')

bsx <- blrm(HE6  ~ cir*rcs(Age,3)+ linf+ pol(EORTC)+esquema+estadio+riesgo, 
            ~ rcs(Age,3)+ pol(EORTC), cppo=function(y) y, data=data) 

newdata <- data.frame(cir=0, Age=85, EORTC= 10, linf=0, riesgo=0, esquema=1, estadio=1)
predict(bsx, newdata, type='fitted.ind') #

albertocarm avatar Oct 22 '20 13:10 albertocarm

Thanks for the reproducible example. I tried

for(s in 4 : 10) {
  cat('s:', s, '\n')
f <- blrm(HE6  ~ cir*rcs(Age,3)+ linf+ pol(EORTC)+esquema+estadio+riesgo, 
            ~ rcs(Age,3)+ pol(EORTC), cppo=function(y) y, data=data, seed=s) 

  print(predict(f, newdata, type='fitted.ind'))
}

and could not get any negative point estimates although I saw a few lower uncertainty intervals that were slightly negative.

harrelfe avatar Oct 22 '20 19:10 harrelfe

That is because it is a subtle issue, difficult to reproduce. It only seems to occur if some levels of the dependent variable are rare, and use extreme values for continuous predictors. To reproduce the error please use set.seed(836) to obtain exactly the same dataset. Notice the low frequency for y= 7 and y = 8. With my exact code you get this output, with an estimated negative probability for these same levels:

> set.seed(836)
> data<- data.frame(HE6=sample(1:10, 200, replace = TRUE, prob=c(rep(0.1,6),0.01,0.002,0.19,0.198) ), 
+                   Age = sample(1:85, 200, replace = TRUE), EORTC = sample(1:100, 200, replace = TRUE), 
+                   linf=rbinom(200, 1,.5),
+                   cir=rbinom(200, 1,.5),esquema=rbinom(200, 1,.5), riesgo=factor(rbinom(200, 2,.5)), estadio=factor(rbinom(200, 2,.5)))
> head(data)
  HE6 Age EORTC linf cir esquema riesgo estadio
1   6  85    42    0   0       1      1       0
2   5  49    49    1   0       0      1       2
3   6  31    58    1   0       0      1       2
4   6  73    81    1   1       1      0       0
5  10  81     6    1   0       0      1       1
6   9  28    28    0   0       0      0       1
> table(data$HE6)

 1  2  3  4  5  6  7  8  9 10 
18 16 22 13 27 22  3  1 46 32 
> dd<-datadist(data)
> options(datadist='dd')
> 
> bsx <- blrm(HE6  ~ cir*rcs(Age,3)+ linf+ pol(EORTC)+esquema+estadio+riesgo, 
+             ~ rcs(Age,3)+ pol(EORTC), cppo=function(y) y, data=data) 
Warning messages:
1: There were 920 divergent transitions after warmup. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them. 
2: Examine the pairs() plot to diagnose sampling problems
 
3: Some Pareto k diagnostic values are slightly high. See help('pareto-k-diagnostic') for details.
 
> #plot(summary (bsx, ycut=5))
> #plot(summary(bsx)) 
> 
> newdata <- data.frame(cir=0, Age=85, EORTC= 10, linf=0, riesgo=0, esquema=1, estadio=1) # aquí está la interfaz con .net
> 
> predict(bsx, newdata, type='fitted.ind') # 
        y x   Mean  Lower Upper
1   HE6=1 1  0.078  0.006 0.189
2   HE6=2 1  0.064  0.009 0.132
3   HE6=3 1  0.088  0.025 0.156
4   HE6=4 1  0.058  0.011 0.107
5   HE6=5 1  0.119  0.048 0.180
6   HE6=6 1  0.101  0.045 0.159
7   HE6=7 1 -0.006 -0.044 0.033
8   HE6=8 1 -0.010 -0.050 0.028
9   HE6=9 1  0.262  0.154 0.366
10 HE6=10 1  0.245  0.052 0.469


albertocarm avatar Oct 22 '20 20:10 albertocarm

Therefore, when I try to plot Pr (y>= j) I get a zigzag instead of a downward curve

> predict(bsx, newdata, type='fitted') # 
      y x  Mean Lower Upper
1  y>=2 1 0.922 0.811 0.994
2  y>=3 1 0.858 0.690 0.986
3  y>=4 1 0.770 0.549 0.948
4  y>=5 1 0.712 0.476 0.918
5  y>=6 1 0.593 0.318 0.833
6  y>=7 1 0.492 0.227 0.752
7  y>=8 1 0.497 0.221 0.753
8  y>=9 1 0.507 0.241 0.785
9 y>=10 1 0.245 0.052 0.469

albertocarm avatar Oct 22 '20 20:10 albertocarm

Since ordinality is relaxed with respect to two of the variables, it may be reasonable to expect this to occur (as long as values are just slightly negative) for a few posterior sampling paths. We may want to override negatives with zeros. It would be interested to learn if this ever happens for large sample sizes.

harrelfe avatar Oct 22 '20 21:10 harrelfe

I have repeated it with 2000 simulated observations. The problem is minor, but it is still happening. Then, for the probability Pr (y >= j) you suggest adding up all the individual probabilities up to j, changing the negatives to zero, but he sum of point probabilities will no longer be 1. And wouldn't a positive prior be better?

albertocarm avatar Oct 22 '20 21:10 albertocarm