markovchain icon indicating copy to clipboard operation
markovchain copied to clipboard

DoF computing mistake in function verifyHomogeneity

Open drobelin opened this issue 6 years ago • 3 comments

Hi, Seems there is a mistake in the calculation of the number of degree of freedom of the "verifyHomogeneity" test.

Running the example shows 35 DoF instead of 30 following the documentation ("an_introduction_to_markovchain_package.pdf, p.35), as the considered Markov chains have 6 states.

>  verifyHomogeneity(kullback)
 Testing homogeneity of DTMC underlying input list 
 ChiSq statistic is 275.9963 d.o.f are 35 corresponding p-value is 0 
 $statistic
 [1] 275.9963
 
 $dof
 [1] 35
 
 $pvalue
 [1] 0

drobelin avatar Nov 13 '17 15:11 drobelin

May I open a new ticket, but it seems there is a bug in the calculation of "verifyHomogeneity" test. I made some simulations in order to evaluate the distribution of the test-statistic under null hypothesis. Below, I will find R code which compute by simulation the empirical distribution of the statistics given by verifyHomogeneity test, and a log-likelihood ratio. I think that the formula p.35 of "an_introduction_to_markovchain_package.pdf" needs a correction too.

require(markovchain)

nbsimu = 10000               # Nb of simulation
statesNames=c("A","B")  # 2 states : A , B
n=c(100,200)                  # Respective length of both simulated markov chains
transitionMatrix1 = matrix(c(0.7,0.3,0.1,0.9),byrow=TRUE, nrow=2, dimnames=list(statesNames,statesNames))
transitionMatrix2 = matrix(c(0.7,0.3,0.1,0.9),byrow=TRUE, nrow=2, dimnames=list(statesNames,statesNames))

res = matrix(nrow = nbsimu, ncol = 5)   #To store results of each iteration
for(i in 1:nbsimu) {
  M1 = new("markovchain", transitionMatrix=transitionMatrix1) 
  M2 = new("markovchain", transitionMatrix=transitionMatrix2) 
  m1 = markovchainSequence(n = n[1], markovchain = M1)
  m2 = markovchainSequence(n = n[2], markovchain = M2)
  m = c(m1,NA,m2)  #useful to compute model under null hypothesis
  m1.cpt = unclass(table(m1,c(m1[-1],NA))) #class "matrix"
  m2.cpt = unclass(table(m2,c(m2[-1],NA))) #class "matrix"
  m1f = markovchainFit(m1,confint=FALSE)
  m2f = markovchainFit(m2,confint=FALSE)
  mf = markovchainFit(m)
  vh = verifyHomogeneity(list(m1.cpt,m2.cpt), verbose = FALSE)
  res[i,] = c(m1f$logLikelihood, m2f$logLikelihood, mf$logLikelihood, vh$statistic, vh$pvalue)
}
res = as.data.frame(res)
names(res) <- c("L1", "L2", "L","vhStat","vhPvalue")
# Correcting DoF following above comment (2 instead of 3)
res$vhPvalueCorr <- pchisq(res$vhStat, df = 2, lower.tail = FALSE)
#Log-likelihood ratio statistic
res$F = 2*((res$L1+res$L2)-res$L)
res$pvalue = pchisq(res$F,df = 2,lower.tail = FALSE)

# Under null hypothesis, p-value should follow a uniform(0,1) distribution
plot(hist(res$vhPvalue))
plot(hist(res$vhPvalueCorr))
plot(hist(res$pvalue))
plot(density(res$F))
plot(density(res$vhStat))

ks.test(res$F,"pchisq",df=2)
ks.test(res$vhStat,"pchisq",df=3)
ks.test(res$vhStat,"pchisq",df=3)
mean(res$F) ; var(res$F)
mean(res$vhStat) ; var(res$vhStat)

drobelin avatar Nov 23 '17 09:11 drobelin

Hi @drobelin first of all I wish to thank you for the issue. I am quite busy at the moment, but I will check if possible. A possible quick solution is for you to suggest a solution and to post a revised version of the function either here or by github pull request

Il gio 23 nov 2017, 10:14 drobelin [email protected] ha scritto:

May I open a new ticket, but it seems there is a bug in the calculation of "verifyHomogeneity" test. I made some simulations in order to evaluate the distribution of the test-statistic under null hypothesis. Below, I will find R code which compute by simulation the empirical distribution of the statistics given by verifyHomogeneity test, and a log-likelihood ratio. I think that the formula p.35 of "an_introduction_to_markovchain_package.pdf" needs a correction too.

require(markovchain)

nbsimu = 10000 # Nb of simulation statesNames=c("A","B") # 2 states : A , B n=c(100,200) # Respective length of both simulated markov chains transitionMatrix1 = matrix(c(0.7,0.3,0.1,0.9),byrow=TRUE, nrow=2, dimnames=list(statesNames,statesNames)) transitionMatrix2 = matrix(c(0.7,0.3,0.1,0.9),byrow=TRUE, nrow=2, dimnames=list(statesNames,statesNames))

res = matrix(nrow = nbsimu, ncol = 5) #To store results of each iteration for(i in 1:nbsimu) { M1 = new("markovchain", transitionMatrix=transitionMatrix1) M2 = new("markovchain", transitionMatrix=transitionMatrix2) m1 = markovchainSequence(n = n[1], markovchain = M1) m2 = markovchainSequence(n = n[2], markovchain = M2) m = c(m1,NA,m2) #useful to compute model under null hypothesis m1.cpt = unclass(table(m1,c(m1[-1],NA))) #class "matrix" m2.cpt = unclass(table(m2,c(m2[-1],NA))) #class "matrix" m1f = markovchainFit(m1,confint=FALSE) m2f = markovchainFit(m2,confint=FALSE) mf = markovchainFit(m) vh = verifyHomogeneity(list(m1.cpt,m2.cpt), verbose = FALSE) res[i,] = c(m1f$logLikelihood, m2f$logLikelihood, mf$logLikelihood, vh$statistic, vh$pvalue) } res = as.data.frame(res) names(res) <- c("L1", "L2", "L","vhStat","vhPvalue")

Correcting DoF following above comment (2 instead of 3)

res$vhPvalueCorr <- pchisq(res$vhStat, df = 2, lower.tail = FALSE) #Log-likelihood ratio statistic res$F = 2*((res$L1+res$L2)-res$L) res$pvalue = pchisq(res$F,df = 2,lower.tail = FALSE)

Under null hypothesis, p-value should follow a uniform(0,1) distribution

plot(hist(res$vhPvalue)) plot(hist(res$vhPvalueCorr)) plot(hist(res$pvalue)) plot(density(res$F)) plot(density(res$vhStat))

ks.test(res$F,"pchisq",df=2) ks.test(res$vhStat,"pchisq",df=3) ks.test(res$vhStat,"pchisq",df=3) mean(res$F) ; var(res$F) mean(res$vhStat) ; var(res$vhStat)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/spedygiorgio/markovchain/issues/142#issuecomment-346564255, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyWxrQXNtlf7Rr7TmzgFUhKLrK1j_dmks5s5Td7gaJpZM4Qb73Y .

spedygiorgio avatar Nov 23 '17 10:11 spedygiorgio

So far I have seen, the degrees of freedom are calculated folowing the formula written in the "Kullbak et al 1962" publication (see table 9.1). However, I do not have enough knowlegde to judge if this is the proper way of calculating the dof´s.

xabiben avatar Oct 17 '18 08:10 xabiben