pymc-examples
pymc-examples copied to clipboard
survival analysis
File: https://github.com/pymc-devs/pymc-examples/blob/main/examples/survival_analysis/survival_analysis.ipynb Reviewers:
The sections below may still be pending. If so, the issue is still available, it simply doesn't have specific guidance yet. Please refer to this overview of updates
Known changes needed
Changes listed in this section should all be done at some point in order to get this notebook to a "Best Practices" state. However, these are probably not enough! Make sure to thoroughly review the notebook and search for other updates.
General updates
- fix typos
- fix math renderend incorrectly
- update notation to use equal tailed interval (or ETI) instead of HPD (or HDI), both are valid credible intervals, but they are different, so we should not write HPD if using ETI
ArviZ related
- Use arviz (or only import? I don't really understand how that notebook runs)
- Use InferenceData and xarray for exploratory analysis of the results
Changes for discussion
Changes listed in this section are up for discussion, these are ideas on how to improve the notebook but may not have a clear implementation, or fix some know issue only partially.
General updates
ArviZ related
Notes
Exotic dependencies
Computing requirements
In addition, should I update the notebook as mentioned in pymc-devs/pymc3#4027?
@olgadk7 That would be terrific, thanks!
That would be amazing indeed, thanks! Just to clarify, we don't expect you to fix all the bullet points above, choose which ones you want to address and let us know so we can review based on that and force extra work on you after reviewing.
Regarding ArviZ, in the survival_analysis notebook it doesn’t do anything that pymc3 couldn’t, does it make sense then to only use pymc3 perhaps? Unless I’m missing some rules that govern how to chose between the two (didn’t find this info after a quick search, just that you seemingly always need to set the ArviZ style, as per the Notebook Style Guide).
Related to that, I’m confused how and why the notebook’s source code https://github.com/pymc-devs/pymc-examples/blob/main/examples/survival_analysis/survival_analysis.ipynb calls to az. (without even importing it), but the published notebook https://docs.pymc.io/notebooks/survival_analysis.html uses pymc3 only.
On another note, Can I add the “with model” context argument to get rid of the corresponding FutureWarning of deprecation?
Regarding the other needed changes, once I deal with the low-hanging fruit, I’d be happy to look into it and report back asap but most likely I wouldn’t trust myself with the more consequential stuff just yet (only if you “force” me) :)
Regarding ArviZ, in the survival_analysis notebook it doesn’t do anything that pymc3 couldn’t, does it make sense then to only use pymc3 perhaps?
PyMC3 has ArviZ as a dependency, and delegates plotting, stats and diagnostics to it. However, for convenience most ArviZ functions are also available as part of the of the pymc3 namespace, pm.plot_trace
or pm.summary
are aliases to the corresponding ArviZ function. While you are free to use those aliases, we want to encourage using ArviZ directly.
Related to that, I’m confused how and why the notebook’s source code https://github.com/pymc-devs/pymc-examples/blob/main/examples/survival_analysis/survival_analysis.ipynb calls to az. (without even importing it),
I have no idea myself, and what is worse, the notebook looks executed sequentially unless I counted wrong.
but the published notebook https://docs.pymc.io/notebooks/survival_analysis.html uses pymc3 only.
Docs are build only with releases, and those changes were added to pymc-examples after the last release, they will be published with the next release
On another note, Can I add the “with model” context argument to get rid of the corresponding FutureWarning of deprecation?
which FutureWarning? We are currently working on the next major version of PyMC3 so there are several laying around :sweat_smile:
Regarding the other needed changes, once I deal with the low-hanging fruit, I’d be happy to look into it and report back asap but most likely I wouldn’t trust myself with the more consequential stuff just yet (only if you “force” me) :)
If you are interested in working on that go for it! It's only a matter of we'll be here to provide any help necessary
Thank you! And, got it, will use ArviZ directly.
Re FutureWarning: it seems like where the trace is called without the context of the model object ("with model:”) you get a FutureWarning saying "Using from_pymc3
without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context." (adding "with model:" gets rid of the warning.)
Re FutureWarning: it seems like where the trace is called without the context of the model object ("with model:”) you get a FutureWarning saying "Using from_pymc3 without the model will be deprecated in a future release. Not using the model will return less accurate and less useful results. Make sure you use the model argument or call from_pymc3 within a model context." (adding "with model:" gets rid of the warning.)
Yes, that warning is probably triggered when plotting right? What is happening is that to plot, ArviZ first checks if the input is an InferenceData object, if it isn't, it tries to convert the input to inferencedata automatically. Converting pm.MultiTrace to InferenceData works with only the info in the trace in many cases, but in some others info from the model instance is also needed, so the converter needs both. Calling the plotting functions from within the model context make the model available to the converter, but this is not ideal.
The recommended solution is to first convert to inferencedata or directly use return_inferencedata=True
in pm.sample
and then use an inferencedata for plotting.
That worked, thank you for explaining! And you're right: triggered when plotting. Should I then add return_inferencedata=True
in every pm.sample
in the notebook so that there aren't any of those FutureWarnings?
Should I then add return_inferencedata=True in every pm.sample in the notebook so that there aren't any of those FutureWarnings?
Yes, moreover, return_inferencedata=True will be the default in next version so if at some place you wanted to avoid inferencedata, you'd need to be explicit and manually pass False. Docs should show how to use inferencedata and other best practices though
After a cascade of changes resulting from a simple inferencedata change, including updating pymc3, theano-pymc3, as well as my whole environment, and conda-removing theano, I’m up against the non-descriptive “NotImplementedError:” when calling aesara methods listed in its docs. Looked for solutions everywhere, please direct.
This is the line causing it: “lambda_ = pm.Deterministic("lambda_", T.outer(T.exp(beta * df.metastasized), lambda0))” except I replaced “T.” with “at.” as there was a big messy compilation error after removing theano. Am I off-track?
Which versions do you have installed/are using? You shouldn't be using Aesara yet, the latest pymc3 stable release (3.11.2) runs on theano-pymc only: https://github.com/pymc-devs/pymc3/blob/v3.11.2/requirements.txt
oh ok - i was under the impression that theano-pymc is aesara, as that's where the theano-pymc link leads to on the installation guide, as well as on the "PyMC3 and Theano" page. So to run theano-pymc, i use "import theano"? Versions:
Python implementation: CPython
Python version : 3.7.10
IPython version : 7.22.0
arviz : 0.11.2
cachetools : 4.2.1
dill : 0.3.3
numpy : 1.19.2
pandas : 1.2.3
patsy : 0.5.1
scipy : 1.6.2
theano-pymc : 1.1.2
typing_extensions : 3.7.4.3
pymc3 : 3.11.1
aesara : 2.0.6
Thanks!
So to run theano-pymc, i use "import theano"?
Yes! I think that should be enough for everything to work.
So I'm back to that big messy compilation error (below). everything i read about it is either old, contradictory or beyond me. do you have an idea what's it about? Thank you.
INFO (theano.gof.compilelock): Refreshing lock /Users/olgakahn/.theano/compiledir_Darwin-19.6.0-x86_64-i386-64bit-i386-3.7.10-64/lock_dir/lock
You can find the C code in this temporary file: /var/folders/n9/bmmyc8hd0x505y9fd0hm1mc40000gn/T/theano_compilation_error_8n8o433a
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-16-bceb15e5e724> in <module>
5 beta = pm.Normal("beta", 0, sd=1000)
6
----> 7 lambda_ = pm.Deterministic("lambda_", T.outer(T.exp(beta * df.metastasized), lambda0))
8 mu = pm.Deterministic("mu", exposure * lambda_)
9
~/anaconda3/lib/python3.7/site-packages/pymc3/model.py in Deterministic(name, var, model)
1431 """
1432 model = modelcontext(model)
-> 1433 var = var.copy(model.name_for(name))
1434 model.deterministics.append(var)
1435 model.add_random_variable(var)
~/anaconda3/lib/python3.7/site-packages/theano/tensor/var.py in copy(self, name)
620 Does not copy the tags.
621 """
--> 622 copied_variable = theano.tensor.basic.tensor_copy(self)
623 copied_variable.name = name
624 return copied_variable
~/anaconda3/lib/python3.7/site-packages/theano/gof/op.py in __call__(self, *inputs, **kwargs)
668 # compute output value once with test inputs to validate graph
669 thunk = node.op.make_thunk(node, storage_map, compute_map,
--> 670 no_recycling=[])
671 thunk.inputs = [storage_map[v] for v in node.inputs]
672 thunk.outputs = [storage_map[v] for v in node.outputs]
~/anaconda3/lib/python3.7/site-packages/theano/gof/op.py in make_thunk(self, node, storage_map, compute_map, no_recycling, impl)
953 try:
954 return self.make_c_thunk(node, storage_map, compute_map,
--> 955 no_recycling)
956 except (NotImplementedError, utils.MethodNotDefined):
957 # We requested the c code, so don't catch the error.
~/anaconda3/lib/python3.7/site-packages/theano/gof/op.py in make_c_thunk(self, node, storage_map, compute_map, no_recycling)
856 _logger.debug('Trying CLinker.make_thunk')
857 outputs = cl.make_thunk(input_storage=node_input_storage,
--> 858 output_storage=node_output_storage)
859 thunk, node_input_filters, node_output_filters = outputs
860
~/anaconda3/lib/python3.7/site-packages/theano/gof/cc.py in make_thunk(self, input_storage, output_storage, storage_map, keep_lock)
1215 cthunk, module, in_storage, out_storage, error_storage = self.__compile__(
1216 input_storage, output_storage, storage_map,
-> 1217 keep_lock=keep_lock)
1218
1219 res = _CThunk(cthunk, init_tasks, tasks, error_storage, module)
~/anaconda3/lib/python3.7/site-packages/theano/gof/cc.py in __compile__(self, input_storage, output_storage, storage_map, keep_lock)
1155 output_storage,
1156 storage_map,
-> 1157 keep_lock=keep_lock)
1158 return (thunk,
1159 module,
~/anaconda3/lib/python3.7/site-packages/theano/gof/cc.py in cthunk_factory(self, error_storage, in_storage, out_storage, storage_map, keep_lock)
1622 node.op.prepare_node(node, storage_map, None, 'c')
1623 module = get_module_cache().module_from_key(
-> 1624 key=key, lnk=self, keep_lock=keep_lock)
1625
1626 vars = self.inputs + self.outputs + self.orphans
~/anaconda3/lib/python3.7/site-packages/theano/gof/cmodule.py in module_from_key(self, key, lnk, keep_lock)
1187 try:
1188 location = dlimport_workdir(self.dirname)
-> 1189 module = lnk.compile_cmodule(location)
1190 name = module.__file__
1191 assert name.startswith(location)
~/anaconda3/lib/python3.7/site-packages/theano/gof/cc.py in compile_cmodule(self, location)
1525 lib_dirs=self.lib_dirs(),
1526 libs=libs,
-> 1527 preargs=preargs)
1528 except Exception as e:
1529 e.args += (str(self.fgraph),)
~/anaconda3/lib/python3.7/site-packages/theano/gof/cmodule.py in compile_str(module_name, src_code, location, include_dirs, lib_dirs, libs, preargs, py_module, hide_symbols)
2397 # difficult to read.
2398 raise Exception('Compilation failed (return status=%s): %s' %
-> 2399 (status, compile_stderr.replace('\n', '. ')))
2400 elif config.cmodule.compilation_warning and compile_stderr:
2401 # Print errors just below the command line.
Exception: ("Compilation failed (return status=1): /Users/olgakahn/.theano/compiledir_Darwin-19.6.0-x86_64-i386-64bit-i386-3.7.10-64/tmpss108u77/mod.cpp:400:27: error: non-constant-expression cannot be narrowed from type 'npy_intp' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. int init_totals[2] = {V3_n0, V3_n1};. ^~~~~. /Users/olgakahn/.theano/compiledir_Darwin-19.6.0-x86_64-i386-64bit-i386-3.7.10-64/tmpss108u77/mod.cpp:400:27: note: insert an explicit cast to silence this issue. int init_totals[2] = {V3_n0, V3_n1};. ^~~~~. static_cast<int>( ). /Users/olgakahn/.theano/compiledir_Darwin-19.6.0-x86_64-i386-64bit-i386-3.7.10-64/tmpss108u77/mod.cpp:400:34: error: non-constant-expression cannot be narrowed from type 'npy_intp' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. int init_totals[2] = {V3_n0, V3_n1};. ^~~~~. /Users/olgakahn/.theano/compiledir_Darwin-19.6.0-x86_64-i386-64bit-i386-3.7.10-64/tmpss108u77/mod.cpp:400:34: note: insert an explicit cast to silence this issue. int init_totals[2] = {V3_n0, V3_n1};. ^~~~~. static_cast<int>( ). /Users/olgakahn/.theano/compiledir_Darwin-19.6.0-x86_64-i386-64bit-i386-3.7.10-64/tmpss108u77/mod.cpp:412:9: error: non-constant-expression cannot be narrowed from type 'ssize_t' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. V3_stride0, V3_stride1, . ^~~~~~~~~~. /Users/olgakahn/.theano/compiledir_Darwin-19.6.0-x86_64-i386-64bit-i386-3.7.10-64/tmpss108u77/mod.cpp:412:9: note: insert an explicit cast to silence this issue. V3_stride0, V3_stride1, . ^~~~~~~~~~. static_cast<int>( ). /Users/olgakahn/.theano/compiledir_Darwin-19.6.0-x86_64-i386-64bit-i386-3.7.10-64/tmpss108u77/mod.cpp:412:21: error: non-constant-expression cannot be narrowed from type 'ssize_t' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. V3_stride0, V3_stride1, . ^~~~~~~~~~. /Users/olgakahn/.theano/compiledir_Darwin-19.6.0-x86_64-i386-64bit-i386-3.7.10-64/tmpss108u77/mod.cpp:412:21: note: insert an explicit cast to silence this issue. V3_stride0, V3_stride1, . ^~~~~~~~~~. static_cast<int>( ). /Users/olgakahn/.theano/compiledir_Darwin-19.6.0-x86_64-i386-64bit-i386-3.7.10-64/tmpss108u77/mod.cpp:413:1: error: non-constant-expression cannot be narrowed from type 'ssize_t' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. V1_stride0, V1_stride1. ^~~~~~~~~~. /Users/olgakahn/.theano/compiledir_Darwin-19.6.0-x86_64-i386-64bit-i386-3.7.10-64/tmpss108u77/mod.cpp:413:1: note: insert an explicit cast to silence this issue. V1_stride0, V1_stride1. ^~~~~~~~~~. static_cast<int>( ). /Users/olgakahn/.theano/compiledir_Darwin-19.6.0-x86_64-i386-64bit-i386-3.7.10-64/tmpss108u77/mod.cpp:413:13: error: non-constant-expression cannot be narrowed from type 'ssize_t' (aka 'long') to 'int' in initializer list [-Wc++11-narrowing]. V1_stride0, V1_stride1. ^~~~~~~~~~. /Users/olgakahn/.theano/compiledir_Darwin-19.6.0-x86_64-i386-64bit-i386-3.7.10-64/tmpss108u77/mod.cpp:413:13: note: insert an explicit cast to silence this issue. V1_stride0, V1_stride1. ^~~~~~~~~~. static_cast<int>( ). 6 errors generated.. ", '[Elemwise{identity}(<TensorType(float64, matrix)>)]')
I always start by deleting my ~/.theano directory. That sometimes fixes odd compilation issues (but not always).
Thank you, that was a problem related to conda-forge installing an older version seemingly.
I have another question, please: what's the best way to realign trace's shape when it became an InferenceData object (after being a MultiTrace)? Getting a "ValueError: operands could not be broadcast together with shapes" when doing trace.posterior["lambda0"] * np.exp(np.atleast_2d(trace.posterior["beta"]).T)
.
InferenceData shape:
print(trace.posterior["lambda0"].shape)
print(trace.posterior["beta"].shape)
(4, 1000, 76) (4, 1000)
MultiTrace shape:
print(MultiTrace_trace["lambda0"].shape)
print(MultiTrace_trace["beta"].shape)
(4000, 76) (4000,)
Appreciate your guidance.
I have another question, please: what's the best way to realign trace's shape when it became an InferenceData object (after being a MultiTrace)? Getting a "ValueError: operands could not be broadcast together with shapes" when doing trace.posterior["lambda0"] * np.exp(np.atleast_2d(trace.posterior["beta"]).T).
One of the coolest features of InferenceData is that the data is stored in xarray Datasets. xarray can be seen as the pandas equivalent for n-d data and it uses named dimensions and coordinates which allow it to automatically align and broadcast arrays automagically. All the code above with the atleast2d and transpose is now unnecessary, you can do trace.posterior["lambda0"] * np.exp(trace.posterior["beta"]).
Quick question: ArviZ's label guide says all plotting functions support the 'labeller' argument but I don't see it in plot_hdi
. How do we attach the labels to the hdi plot's legend otherwise? ps it's been nice to discover the world of the xarray, thank you!
plot_hdi
doesn't support a legend "by itself". You can use for example fill_kwargs={"label": "var_name"}
to get one. It will look like https://docs.pymc.io/notebooks/multilevel_modeling.html. Maybe we could be explicit on that in the label guide, but basically there are a few plotting functions in ArviZ that are "low" level, these don't take InferenceData
as input, and are generally called by other plotting functions, so they behave a bit differently from the others, especially when it comes to labelling, as they are designed to take array inputs so labels must be passed to them explicitly. The low level functions are plot_kde
, plot_dist
and plot_hdi
(which has no higher level counterpart for now, hopefully we'll get to add regression specific plots this summer that use it).
Also note that ArviZ docs are for the development version, not for the release. The labeller
feature has not been released yet. To use it you'll need to install ArviZ development version, which I would generally recommend (there aren't any differences in testing between releases and commits to main
and we don't use release candidates), except for examples and situations where you want to be able to pin down a specific release.
Hopefully one last piece I don’t get:
I’m trying to plot on the y-axis the mean of a DataArray tv_met_hazard.mean(dim=(“chain", "draw”))
where
tv_met_hazard = time_varying_trace.posterior["lambda0"] * np.exp(time_varying_trace.posterior["beta"])
and has coords chain, draw, lambda0_dim_0, and beta_dim_0.
It returns the same values as its MultiTrace version but with a different mean of means, and in 2-D:
tv_met_hazard.mean(("chain", "draw")).shape
(76, 76)
even
tv_met_hazard.mean(("chain", "draw")).values.shape
(76, 76)
While similarly constructed da
met_hazard = trace.posterior["lambda0"] * np.exp(trace.posterior["beta"])
has coords chain, draw, and lambda0_dim_0 and returns
met_hazard.mean(("chain", "draw")).shape
(76,)
The only difference is that the former da’s beta parameter is modeled with pm.GaussianRandomWalk and the latter’s beta with pm.Normal. I've also tried to include the beta_dim_0 parameter in calculating the mean but the values are off.
What am I missing? Thank you.
InferenceData uses named dimensions, if no dimension names are given, default names based on the variable name are used. This is where lambda0_dim_0
and beta_dim_0
come from. When you add them, the broadcasting and alignment is based on the dimension names, not on the shape. You are therefore adding arrays with chain, draw, lambda0_dim_0
and chain, draw, beta_dim_0
, so the result is an array with chain, draw, lambda0_dim_0, and beta_dim_0
. As this dimensions are actually the same, you have to use named dimensions when writing the model (recommended) or when converting to inferencedata. I wrote a blogpost about this: https://oriolabril.github.io/oriol_unraveled/python/arviz/pymc3/xarray/2020/09/22/pymc3-arviz.html
In MultiTrace, you have shapes 4, 1000, 76
and 4, 1000, 76
(I guessed the number of chains and draws based on defaults) so the result is 4, 1000, 76
(you may even have 4000
with chain and draw flattened). In this particular case, the sum works automatically, but if you had 4, 1000, 76
and 4, 1000, 35
and wanted to get a 4, 1000, 76, 53
output you'd need to do a[:, :, :, None] + b[:, :, None, :]
whereas in xarray with correct dimension names it is still a + b
alone and everything works.
Thank you! Submitted the commit https://github.com/pymc-devs/pymc-examples/pull/96. Happy to hear if any other changes are needed.
Can I take this one to update it to v4?
reopening to place in v4 branch as it still needs style updates