Turing.jl
Turing.jl copied to clipboard
Improper (?) chains resuming
Hi to everybody, I have been playing with Turing and I am quite satisfied. It's a very good library, thank you!
I am trying to use the Cluster of my University with Turing and it looks it is working...what bothers me is the possibility of resuming chains. This is the code I am using
using Turing
using Distributed
using ClusterManagers
using Dates
n_process = 30
#add n_processes on the long queue
for i in 1:n_process
time = string(Dates.now())
ClusterManagers.addprocs_lsf(1; bsub_flags = `-q long -o Turing/outfiles/$time.out -e Turing/outfiles/$time.err -n 2 `)
end
#loading Turing on the processes
@everywhere using Turing
#toy model found in the Turing documentation
@everywhere @model function gdemo(x)
s² ~ InverseGamma(2, 3)
m ~ Normal(0, sqrt(s²))
for i in eachindex(x)
x[i] ~ Normal(m, sqrt(s²))
end
end
@everywhere model = gdemo([1.5, 2.0])
#sampling
chains = sample(model, MH(), MCMCDistributed(), 40, n_process; save_state = true)
write("Turing/chains/first_chain-file.jls", chains)
#resume sampling
chainsb = sample(model, MH(), MCMCDistributed(), 40, n_process; save_state = true, resume_from = chains)
write("Turing/chains/resumed_chain-file.jls", chainsb)
Everything looks to work without problems, but then I decided to check the last step of the chains in chains
and the first step of the chains inchainsb
, obtaining respectively
-2.5569926638379856
0.6013349624568101
0.8590870562760059
2.395092327070279
0.30633825992172603
1.8721031535336894
0.3845124887448378
1.0331373520741012
0.3914729143619578
1.1355888789926338
0.4570932249605562
1.9812393973041307
-0.07421084058792975
0.4170399419102264
1.5349445943181428
0.49824038196769993
2.0694996617266224
1.8940648949354408
-0.01762467471248246
3.772357305507156
1.3129433939104416
0.7417751809903653
1.110839325040691
2.846731924304791
0.6912786220073702
1.8708838415407694
1.982571326551052
1.3601574483697967
1.3642559267226588
1.044768883206148
0.13689937832763854
-0.7359908349472912
-0.713229220500176
1.0786609524455146
0.6055944537774374
0.31771407030069704
-2.0896892538535012
-2.5569926638379856
-0.9688255168942798
-2.5569926638379856
-0.19465181475999574
0.25116986501880706
0.2912230673438316
-0.870484352037458
-0.7385675043481791
-0.12438506493203039
-0.7973532014797547
1.2906733092061466
0.8261938949153473
-0.8878086665415579
-0.23865719371572905
-1.016826590909152
2.567799727834964
1.998903315214728
0.8754360818675754
0.1421659708848039
0.924392672495571
-2.5569926638379856
-2.5569926638379856
0.4415929537755887
It looks that the last step of the first chain ( -2.5569926638379856) is used to restart ALL the chains...however, I'd expect the i.th chain to restart from the last step of i-th chain. Is this the expected behaviour or is there a mistake? Am I missing something? Thank you, Marco
Yeah, this is an error. I'm working on something elsewhere to make the chain resume functionality way better.
Thank you for your answer. If you need help with docs, testing, please feel free to send me a message.
Likely fixed by recent updates to MCMCChains
.