pymc4
pymc4 copied to clipboard
Saving the trace
I love the progress with PyMC4 so far. Are there plans to implement a save_trace
function in PyMC4?
I have been trying the following:
import pymc4 as pm
...
model = example_model()
trace = pm.sample(model, burn_in=500, num_samples=1000, num_chains=10, xla=True \
adaptation_kwargs=dict(target_accept_prob=0.95))
filepath = 'trace.nc'
trace.to_netcdf(filepath)
and it gives this error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
~/.virtualenvs/pymc4/lib/python3.6/site-packages/xarray/backends/api.py in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf)
1088 dump_to_store(
-> 1089 dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims
1090 )
~/.virtualenvs/pymc4/lib/python3.6/site-packages/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
1134
-> 1135 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
1136
~/.virtualenvs/pymc4/lib/python3.6/site-packages/xarray/backends/common.py in store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
295 self.set_attributes(attributes)
--> 296 self.set_dimensions(variables, unlimited_dims=unlimited_dims)
297 self.set_variables(
~/.virtualenvs/pymc4/lib/python3.6/site-packages/xarray/backends/common.py in set_dimensions(self, variables, unlimited_dims)
372 is_unlimited = dim in unlimited_dims
--> 373 self.set_dimension(dim, length, is_unlimited)
374
~/.virtualenvs/pymc4/lib/python3.6/site-packages/xarray/backends/netCDF4_.py in set_dimension(self, name, length, is_unlimited)
420 dim_length = length if not is_unlimited else None
--> 421 self.ds.createDimension(name, size=dim_length)
422
netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.createDimension()
netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dimension.__init__()
netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()
RuntimeError: NetCDF: Name contains illegal characters
The illegal characters seem to be the '/' in the variable names. I have a lot of variables, but could manually replace the characters if need be. However, being able to save them as they are would be better, especially for reloading the trace.
Do you have any suggestions on saving the trace to allow such characters? If the '/' character makes saving problematic, are there any alternatives which may be specified when PyMC4 creates the variable names.
Update: This seems to work for saving the posterior, using xarray
's dataset to netcdf function with the 'scipy'
engine.
trace.posterior.to_netcdf('posterior.h5', engine='scipy')
reload with
from xarray import open_dataset
posterior = open_dataset('trace.h5', engine='scipy')
I think I could do this for all the groups in trace, just trace.to_netcdf
and az.to_netcdf(trace)
don't seem to work for me, even when I replace the '/' in the variable names with '.' manually.
Has anyone got any better ideas for saving/loading the trace object? I'm fairly new to xarray
and arviz
. I think it would be useful to have a save trace function in PyMC4 in future and I'd be happy to contribute at some point 😁 I'm excited at the prospect of using PyMC4 in my work so happy to help with its progress!
Hi @alexlyttle! Thanks for pinging in. This is a design interface decision between ArviZ devs and PyMC4 devs - the namespacing in pm4 is probably the reason why the error you saw is happening.
Tagging here @canyon289 @ColCarroll @AlexAndorra and @aloctavodia, who I think are the more active ones on ArviZ amongst us. They probably know better what kind of API changes are needed - perhaps pushing up the engine kwarg to arviz' top-level API, or collecting kwargs?
Hi @alexlyttle, and thanks for the ping @ericmjl !
I don't think this is on our radar, as PyMC4 is not production ready yet. That being said, the fact that you can do it with xarray's to_netcdf
function is probably a hint that this feature could be quite easy to implement.
Pinging @OriolAbril and @ahartikainen as they know more than me about this.
We have to look carefully into this, IIRC, the /
are used as group identifiers and on the ArviZ side we rely heavily on groups. That being said, subgroups are allowed, so this should not break anything if done properly.
I am not following pymc4 closely enough, so the following proposal may make no sense. If the variable names are model_name/var_name
maybe it is possible to drop the starting model_name
and store it as an attribute of the inference data object? There are several places where we use model_names (which currently have to be passed as dict keys to az.compare
, az.plot_elpd
...) and are considering adding a name attribute to inference data objects.
Thanks @ericmjl for tagging the relevant people!
Hi @AlexAndorra thanks for the response. I understand this isn't priority atm but thought it worth mentioning. With regards to PyMC4 not being production ready: this may be a separate issue for discussion elsewhere, but we have been using PyMC3 in our research for a while but our particular model takes ~ hours to run. I recently converted it to PyMC4 and it takes ~ 10 mins due to GPU XLA! This is a game-changer for us and look forward to the full release! We are making sure to keep track of the PyMC4 version, being fully aware it is pre-release, but wondered if it would be fine to use in published work (despite the disclaimer in the README)? My opinion is that if we run final model in PyMC3 and the results differ negligibly, it's fine. If this needs more discussion I can get in touch another way or in a different issue.
Thanks @OriolAbril this makes a lot of sense to me. I think in PyMC4 you can have model_name/sub_model_name/var_name
as in this example but the devs can correct me if I'm wrong! Maybe something which just encodes/decodes the /
for loading/reloading would work. The issue with my idea of saving using the 'scipy'
engine is that /
are replaced with .
but then are still .
upon reload.
@alexlyttle, we're thrilled that your model achieves such a big speedup with pymc4 (and GPU). Could I ask you a few questions? How big is your model, with this I mean how many independent variables does it have to sample from and how many observed data points?
About the naming convention used by pymc4, you are correct. We follow tensorflow's policy of name scopes, which can be nested. Each name snipe is separated by forward slashes, and the rightmost part is the plain name of the variable.
We could do a quick fix by implementing a custom save_trace
and load_trace
that changes the forward slashes to some combination of characters and then reverses that when loading. It's not ideal really but it should be easy to do.
A more complicated but cleaner solution would be to take advantage of the hdf5 and net_cdf groups. The root level of each name scope can be considered a group because it's a model that contains a group of random variables and potentially nested models (nested groups). I consider this to be cleaner but it could potentially impact the arviz groups design, so I don't know what's the best way to do proceed.
@lucianopaz thanks for your interest! It's possible the main contribution to the speed-up is that we are using a regression neural network (fully-connected 6 x 128 hidden neurons) trained using TensorFlow, which maps inputs to outputs (some of which are observables). This is converted to Theano in order to sample in our PyMC3 model (which cannot utilise the GPU efficiently). In PyMC4 we are able to sample the trained neural net entirely on the GPU. It's a hierarchical model with 5 independent hyper-parameters and 5 object-level parameters which feed in to the neural net. There are 4 observables derived from the neural net output for ~100 data points. I hope this made some sense!
I like the sound of the cleaner solution. I may have a go doing something like that in my code and feedback here as I'd be happy to help in some way towards this, but I'm no expert of hdf5 or net_cdf.
I am not sure saving the variables in a nested group structure would be cleaner in practice. If the implementation and specification of the data structure where a little different, the situation would be different.
When saving, we'd have to give the string after the last /
as variable name and the rest as the group, it is an operation that has to be performed on each variable. This is not difficult to do but it is quite inconvenient, we currently call save once on an xarray dataset which stores all variables at once (in the same group).
When loading, we'd have to recursively inspect the groups manually in order to get all variables, load them and once loaded merge them all together into a single xarray dataset. Currently the whole posterior dataset is loaded at once and directly as an xarray dataset.
Moreover, the nested group structure would not be part of the loaded dataset, users cannot select variables based on the values before the /
using xarray. It can be done in ArviZ (starting from latest release) with regular expression matching in var_names
argument, which is independent of the /
, using .
or :
would work in the same way.
And finally, groups/datasets are independent, while coordinate values are shared in groups/dataset objects (that is, the chain, draw, and extra dimensions labels are stored once per group/dataset, not once per variable). Thus, using nested groups we would be storing the exact same coordinate labels multiple times.
Thanks for the feedback @alexlyttle. About publishing, you have to be aware that at the moment, pymc4 has no initial tuning for the NUTS sampler. This means that the mass matrix (which is the covariance matrix of the momentum proposals done at the begining of each HMC step) will not be automatically tuned to the model you are using. This can lead to poor exploration of the state space, suboptimal acceptance rate and also potentially to more divergences during sampling. This does not mean that you cannot use the traces that you get from pymc4, but it means that you have to carefully diagnose your results. You will have to look at the effective sample size (depends on the autocorrelation of the samples in the trace), the Rhat (depends on how well mixed each chain is and if every chain reaches the same steady state), the warmup period in which the samples go from the initial proposal to the steady state typical set, and finally, you should also look at the energy coverage to diagnose whether your MCMC explored the full posterior parameter space or if you are getting biased answers.
Actually, you should always try to look at these quantities, but without a good estimate of the mass matrix, it's almost certain that your trace will have some problematic behavior. At the moment, the only thing that you can do about that is to try to change the step_size
by hand to improve your trace, or to draw more samples from your chain and then thin it by slicing it (as one usually does with metropolis hastings).
@OriolAbril, ok so we will have to write custom save_trace
and load_trace
functions in pymc4 that change the /
character under the hood to something that doesn't break net_cdf.
Thanks for the advice @lucianopaz! Actually, I have been noticing divergences and some other issues with the chains. With regards to changing the step_size
I tried implementing something similar to in the "radon_hierarchical" notebook, but will need to take some time to fully understand this. We'll make sure to check these things (eff. samples, rhat, etc.) before publishing, and will move back to PyMC3 if need be if we struggle with manually tuning. I hope to write a blog post soon with our PyMC4 model to show off its potential for future research. When I do, I'll link it in this thread in case you're interested.