PTMCMCSampler
PTMCMCSampler copied to clipboard
Resuming a calculation with hot chains (-nc)
Because
- Hot chains must always be loaded for resuming to resume correctly,
-
writeHotChains=False
by default, and it is not set as a keyword argument insetup_sampler()
inenterprise extensions
,
I suggest resetting the default argument for now, to avoid problems with the results.
Tagging @vhaasteren , @astrolamb
How about leaving the default to False (because they take up a lot of disk space), and instead throwing an error if resume=True and we are sampling with resume=False?
The only issue with that I see is that it could be that the runs were initially done with resume=False (with writeHotChains=False), and then you cannot abort/resume them.
But I think it's probably not fun for people who now suddenly have jobs that use up a lot of memory. Perhaps we can also set a more useful error message than the one that sparked this adjustment?
I agree with @vhaasteren that just changing the default writeHotChains
is not the right answer. You could not care about resuming and not want to use up lots of disk space. On the other hand, I don't think it makes sense to resume a parallel-tempering run with no data about where to start the hot chains. William Lamb's suggestion to save just enough data to resume is a good one, but would take some implementation effort. Meanwhile I suggest the following minimal changes:
- Apparently it now writes empty hot chain files when
writeHotChains=False
. Change it to not write anything. - Refuse to resume a parallel-tempering run unless we have all the old chains. Give a sensible error message.
- Fix enterprise_extensions to pass on
writeHotChains
to PTMCMCSampler.
I'm not sure about interpreting resume
as "intent to resume later" and so signaling an error when resume=True
but writeHotChains=False
. You might want this. There might be a default resume=True
when you're not intending to resume, or you might be resuming for the last time and not need to write the hot chains further. I would be more inclined to just print a warning in this case, as we do now when Niter
is not a multiple of thin
so that samples will be lost. We could also print a warning when Niter
is not a multiple of isave
, which leads to a partial block in the chain file and then you cannot resume.
Excellent suggestions @kdolum, I agree with all of it. I like the warnings you suggest
Thanks, @vhaasteren and @kdolum , this sounds good. I was initially thinking in terms of 4 temperatures, but if people use >10 saving extra chains by default might indeed cause problems. In case we are avoiding drastic changes, I have one more suggestion. IIn particular, to show a warning instead of an error when attempting to resume the run without the old chains. Perhaps it can also avoid practical issues for users. I incorporated it in my latest commit. What do you think?
I think it should be an error if you try to resume a parallel-tempering run with no hot chain information. What are we supposed to use as the initial samples for the hot chains? I guess it currently uses the initial sample given to the sampler (which I imagine is otherwise ignored on resuming). That seems unfortunate. What is the purpose of resuming in this case, instead of just doing a new run to gather more samples? But if this is useful, we could put in some with additional variable to suppress the error. I just don't think people should get this behavior without an explicit request. I suppose maybe the last sample from the cold chain is a better starting place for the hot chains than a random sample if one does this at all.
Adding a "special request" variable, at least for my case, will not make a difference for me because it's not in enterprise_extensions
anyway. So, perhaps I can just raise an exception instead...
I like the extra variable that suppresses the error that @kdolum suggests. And, we can totally add it to enterprise_extensions. I'd approve such a PR
It would have been neater to not write the hot chain files at all, but the sampler already has this behavior. Let's not change it unless we have to. Perhaps add one more check for self.resume
in the current PR changes to differentiate between error and warning? It makes no sense to resume unless suppressed by this type of variable. Just make the error message usefule
Hi @vhaasteren , @kdolum , I got distracted to other work, but now I introduced this new variable that we discussed. Shall we close the PR now?
I'm confused by the logic here. It looks like what you did is that if the old chain file is the wrong length but ignoreHotChainsWhenResuming is set, and this is not the cold chain, then we ignore the error and start the chain file anew. But it looks like it will still print something about how it is resuming.
- Do we want the flag to always ignore old hot chains, or only ignore them if they are invalid? If the latter, it should have a different name like "allowMissingHotChainsWhenResuming". In any case, I think the function argument should have the full name, not an abbreviation. I would also advocate not ignoring the chain file if it is bogus, but only if the previous run didn't set writeHotChains. I guess that leads to a zero length chain, so only in that specific case do I think we should go on. Otherwise something fishy could be happening and we don't get notified of it.
- I think when it doesn't use the old chain it shouldn't say it is resuming.
- There may be issues with sample counting. If some chains resumed but others didn't, the numbering of samples will be different in different processes. I think if you don't set writeHotChains, then you resume and this time you do set writeHotChains, your hot chains will be shorter than your cold chain. Maybe it doesn't matter because all counting and reporting is done by rank 0, but it might be good to think about whether there will be problems.
- Perhaps it would be best to start the hot chains with the last sample from the cold chain if you don't have a hot chain to resume from. Otherwise there is some time of burning in the hot chains during which time parallel tempering won't help you. I wouldn't use this whole feature myself anyway, so I don't have strong feelings.
@kdolum , I can rename the variable, but the discussion of (1, 3) resuming with zero/non-zero length cold chains, (4) selecting a sample, -- still all boils down to the same case of starting the burn in in hot chains again. So, this seems as micro-managing something that is not important, and/or discriminating between incorrect use cases. One either cares about hot chains and resumes correctly, or not, so it should not make any difference to users.
@vhaasteren , what do you think?