Turing.jl icon indicating copy to clipboard operation
Turing.jl copied to clipboard

Improper (?) chains resuming

Open marcobonici opened this issue 3 years ago • 2 comments

Hi to everybody, I have been playing with Turing and I am quite satisfied. It's a very good library, thank you!

I am trying to use the Cluster of my University with Turing and it looks it is working...what bothers me is the possibility of resuming chains. This is the code I am using

using Turing
using Distributed
using ClusterManagers
using Dates

n_process = 30
#add  n_processes on the long queue
for i in 1:n_process
    time = string(Dates.now())
    ClusterManagers.addprocs_lsf(1; bsub_flags = `-q long -o Turing/outfiles/$time.out -e Turing/outfiles/$time.err -n 2 `)
end

#loading Turing on the processes
@everywhere using Turing

#toy model found in the Turing documentation
@everywhere @model function gdemo(x)
    s² ~ InverseGamma(2, 3)
    m ~ Normal(0, sqrt(s²))

    for i in eachindex(x)
        x[i] ~ Normal(m, sqrt(s²))
    end
end

@everywhere model = gdemo([1.5, 2.0])
#sampling
chains = sample(model, MH(), MCMCDistributed(), 40, n_process; save_state = true)
write("Turing/chains/first_chain-file.jls", chains)

#resume sampling
chainsb = sample(model, MH(), MCMCDistributed(), 40, n_process; save_state = true, resume_from = chains)
write("Turing/chains/resumed_chain-file.jls", chainsb)

Everything looks to work without problems, but then I decided to check the last step of the chains in chains and the first step of the chains inchainsb, obtaining respectively

 -2.5569926638379856
  0.6013349624568101
  0.8590870562760059
  2.395092327070279
  0.30633825992172603
  1.8721031535336894
  0.3845124887448378
  1.0331373520741012
  0.3914729143619578
  1.1355888789926338
  0.4570932249605562
  1.9812393973041307
 -0.07421084058792975
  0.4170399419102264
  1.5349445943181428
  0.49824038196769993
  2.0694996617266224
  1.8940648949354408
 -0.01762467471248246
  3.772357305507156
  1.3129433939104416
  0.7417751809903653
  1.110839325040691
  2.846731924304791
  0.6912786220073702
  1.8708838415407694
  1.982571326551052
  1.3601574483697967
  1.3642559267226588
  1.044768883206148
0.13689937832763854
 -0.7359908349472912
 -0.713229220500176
  1.0786609524455146
  0.6055944537774374
  0.31771407030069704
 -2.0896892538535012
 -2.5569926638379856
 -0.9688255168942798
 -2.5569926638379856
 -0.19465181475999574
  0.25116986501880706
  0.2912230673438316
 -0.870484352037458
 -0.7385675043481791
 -0.12438506493203039
 -0.7973532014797547
  1.2906733092061466
  0.8261938949153473
 -0.8878086665415579
 -0.23865719371572905
 -1.016826590909152
  2.567799727834964
  1.998903315214728
  0.8754360818675754
  0.1421659708848039
  0.924392672495571
 -2.5569926638379856
 -2.5569926638379856
  0.4415929537755887

It looks that the last step of the first chain ( -2.5569926638379856) is used to restart ALL the chains...however, I'd expect the i.th chain to restart from the last step of i-th chain. Is this the expected behaviour or is there a mistake? Am I missing something? Thank you, Marco

marcobonici avatar Aug 30 '21 00:08 marcobonici

Yeah, this is an error. I'm working on something elsewhere to make the chain resume functionality way better.

cpfiffer avatar Aug 30 '21 18:08 cpfiffer

Thank you for your answer. If you need help with docs, testing, please feel free to send me a message.

marcobonici avatar Aug 30 '21 18:08 marcobonici

Likely fixed by recent updates to MCMCChains.

yebai avatar Nov 12 '22 20:11 yebai