pymc-examples
pymc-examples copied to clipboard
Bayesian copula estimation example notebook
This pull request adds a new example notebook for Bayesian copula estimation. Work done by myself and @ericmjl .
- [X] Notebook follows style guide https://docs.pymc.io/en/latest/contributing/jupyter_style.html
- [ ] PR description contains a link to the relevant issue: a tracker one for existing notebooks or a proposal one for new notebooks
- [ ] remove
pymc.*andpymc3.*tags from first cell
Currently works under v3. Not working under v4, possible because LKJCholeskyCov is not refactored for v4 yet (according to @ricardoV94)
Check out this pull request on ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
View / edit / reply to this conversation on ReviewNB
OriolAbril commented on 2021-12-18T11:42:16Z ----------------------------------------------------------------
I would use the filename also as target to reference the notebook so it's not necessary to open the notebook and look at the source to know what target to use.
I also looked at the raw text and saw you are using an html figure to format the image and caption. You should use https://myst-parser.readthedocs.io/en/latest/syntax/optional.html#markdown-figures. This mimimizes html needed which makes sphinx recognize the image, copy it to the _static folder to make sure it is available in the website or pulls it in if generating pdf docs for example and also allows to use myst (and thus markdown) in the caption, for example referencing terms in the glossary or other notebooks.
drbenvincent commented on 2021-12-19T12:00:32Z ----------------------------------------------------------------
Thanks. Have done these things now.
I also simplified the logos. Rather than 2 separate files in a markdown table, I just have a single image with both logos now and no markdown table. Should avoid any strange table rendering issues.
OriolAbril commented on 2021-12-19T16:45:55Z ----------------------------------------------------------------
The :class: myclass is an example to show extra attributes can be used, we don't need to include it here
View / edit / reply to this conversation on ReviewNB
OriolAbril commented on 2021-12-18T11:42:17Z ----------------------------------------------------------------
I need to add a huge warning to the jupyter style guide about this, but never use html links to refer to other notebooks or to other documentation pages. They much more prone to breaking and can't by design respect the version of the documentation being read, they will always point to latest. The binning uses awkward_binning as target from looking at https://github.com/pymc-devs/pymc-examples/blob/main/examples/case_studies/binning.ipynb, so you can use
{ref}here <awkward_binning>
to generate a link with text here pointing to the binning notebook
drbenvincent commented on 2021-12-19T12:03:43Z ----------------------------------------------------------------
This should hopefully be fixed now.
OriolAbril commented on 2021-12-19T16:48:34Z ----------------------------------------------------------------
The ref won't work like this. reviewnb messed up the formatting, but you need 3 things, the reference type (ref), the text to be shown (here) and the target where the link should point to (awkward_binning). See https://docs.readthedocs.io/en/stable/guides/cross-referencing-with-sphinx.html#the-ref-role for more details
View / edit / reply to this conversation on ReviewNB
OriolAbril commented on 2021-12-18T11:42:18Z ----------------------------------------------------------------
Line #6. np.random.seed(seed=42)
this should use rng = np.random.default_rng(42) plus pass the rng when needed to scipy.stats functions. There are two options for this, the base class takes a seed argument and the rvs method takes a random_state argument, both of which are compatible with passing the rng generator to it. Reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.html#scipy.stats.rv_continuous and https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.rvs.html#scipy.stats.rv_continuous.rvs.
From numpy docs: https://numpy.org/doc/stable/reference/random/legacy.html#legacy
This class should only be used if it is essential to have randoms that are identical to what would have been produced by previous versions of NumPy.
which I don't think is the case here.
drbenvincent commented on 2021-12-19T12:11:24Z ----------------------------------------------------------------
Thanks, I didn't know SciPy distributions took rng , but now I do :) Fixed
View / edit / reply to this conversation on ReviewNB
OriolAbril commented on 2021-12-18T11:42:18Z ----------------------------------------------------------------
Is something in the computation of data variable non-deterministic? would we want to perform the computation preserving the chain and draw dimensions and averaging only at the end?
drbenvincent commented on 2021-12-19T12:17:38Z ----------------------------------------------------------------
I'm not sure I follow this question. Is this about lines 14, 15, 16? That computation could be done before, outside of the model as a_mu, a_sigma , b_sigma are basically observed variables. But I'm not sure that's what you are getting at.
ericmjl commented on 2021-12-19T15:18:32Z ----------------------------------------------------------------
Some notes from talking with Oriol:
- The entire block of deterministics can be moved out of the model context.
- We'll probably want to explain why we're using point estimates.
View / edit / reply to this conversation on ReviewNB
OriolAbril commented on 2021-12-18T11:42:19Z ----------------------------------------------------------------
how does this differ from the cov variable in the posterior?
drbenvincent commented on 2021-12-19T12:42:52Z ----------------------------------------------------------------
As far as I can tell, we could use copula_idata.posterior.cov rather than calculate ppc samples.
Doing this results in copula_idata.posterior.cov.stack(sample=("chain", "draw")).data.shape being 2x2x4000. So we then need to iterate over this last sample dimension rather than doing dists = [multivariate_normal([0, 0], cov) for cov in ppc["cov"]]
But it's not immediately obvious to me how to iterate over the sample dimension of an xarray object. Any suggestions?
ericmjl commented on 2021-12-19T15:22:57Z ----------------------------------------------------------------
Possibly a transpose. (Just leaving a note from calling Oriol.)
ericmjl commented on 2021-12-20T00:37:51Z ----------------------------------------------------------------
@drbenvincent I think a transpose is what we can do. For an array that is of shape (2, 2, 4000), I think a transpose using np.transpose(cov, axes=(2, 0, 1)) should do the trick. I think that syntax will take the position 2 dimension and move it to position 0.
OriolAbril commented on 2021-12-20T00:48:39Z ----------------------------------------------------------------
I did some xarray+arviz black magic and updated all the calculations to use the full posterior, which does seem to give the same result. Feel free to use all or parts of that. If doing the transposition here however, do it with xarray and take the values after that. You can use copula_idata.posterior.cov.stack(sample=("chain", "draw")).transpose("sample", ...) (the ... being an actual ellipsis) to ensure the sample dimension is the first and the others keep the same order
Thanks. Have done these things now.
I also simplified the logos. Rather than 2 separate files in a markdown table, I just have a single image with both logos now and no markdown table. Should avoid any strange table rendering issues.
View entire conversation on ReviewNB
Thanks, I didn't know SciPy distributions took rng , but now I do :) Fixed
View entire conversation on ReviewNB
I'm not sure I follow this question. Is this about lines 14, 15, 16? That computation could be done before, outside of the model as a_mu, a_sigma , b_sigma are basically observed variables. But I'm not sure that's what you are getting at.
View entire conversation on ReviewNB
As far as I can tell, we could use copula_idata.posterior.cov rather than calculate ppc samples.
Doing this results in copula_idata.posterior.cov.stack(sample=("chain", "draw")).data.shape being 2x2x4000. So we then need to iterate over this last sample dimension rather than doing dists = [multivariate_normal([0, 0], cov) for cov in ppc["cov"]]
But it's not immediately obvious to me how to iterate over the sample dimension of an xarray object. Any suggestions?
View entire conversation on ReviewNB
Thanks @OriolAbril. I've dealt with all of those points except for two questions.
FYI. Also need some review (when @ericmjl is able) before we merge this. Could you tag his as a reviewer? I don't seem to be able to request reviewers.
I don't understand why the graphviz images aren't displaying.
Some notes from talking with Oriol:
- The entire block of deterministics can be moved out of the model context.
- We'll probably want to explain why we're using point estimates.
View entire conversation on ReviewNB
Possibly a transpose. (Just leaving a note from calling Oriol.)
View entire conversation on ReviewNB
The :class: myclass is an example to show extra attributes can be used, we don't need to include it here
View entire conversation on ReviewNB
The ref won't work like this. reviewnb messed up the formatting, but you need 3 things, the reference type (ref), the text to be shown (here) and the target where the link should point to (awkward_binning). See https://docs.readthedocs.io/en/stable/guides/cross-referencing-with-sphinx.html#the-ref-role for more details
View entire conversation on ReviewNB
I don't understand why the graphviz images aren't displaying.
Don't trust reviewnb preview, always check the readthedocs preview to see if things look good or not. reviewnb messes up quite often, and even in the case it did work, what readers will actually see is readthedocs, so this is what we need to double check.
@drbenvincent I think a transpose is what we can do. For an array that is of shape (2, 2, 4000), I think a transpose using np.transpose(cov, axes=(2, 0, 1)) should do the trick. I think that syntax will take the position 2 dimension and move it to position 0.
View entire conversation on ReviewNB
I did some xarray+arviz black magic and updated all the calculations to use the full posterior, which does seem to give the same result. Feel free to use all or parts of that. If doing the transposition here however, do it with xarray and take the values after that. You can use copula_idata.posterior.cov.stack(sample=("chain", "draw")).transpose("sample", ...) (the ... being an actual ellipsis) to ensure the sample dimension is the first and the others keep the same order
View entire conversation on ReviewNB
Thanks @OriolAbril. That last commit cleans everything up, and uses your recommended way of getting the posterior data of cov out in a convenient format for plotting and comparison to the data. I've also added you to the acknowledgements 🙂
View / edit / reply to this conversation on ReviewNB
OriolAbril commented on 2021-12-27T21:56:31Z ----------------------------------------------------------------
The tags kwarg is not working. ablog errors out when attempting to parse the initial comma, I think removing the comma will fix the issue.
drbenvincent commented on 2021-12-28T10:43:31Z ----------------------------------------------------------------
Extra comma now removed.
View / edit / reply to this conversation on ReviewNB
OriolAbril commented on 2021-12-27T21:56:31Z ----------------------------------------------------------------
This is using point estimates only for computations that are not commutative with the expectation. I did try using the whole posterior here then averaging to get data in the right shape and it did seem to yield similar results, but afaik this is basically coincidence.
In my opinion there are two options here:
- use the whole posterior then average
- use point estimates with a note on why the whole posterior is not used
Note: I have several ideas that hopefully will crystalize in a small package extending xarray for linear algebra and statistical computation. I have a note to come back here once that is available to simplify the code using this while using the whole posterior
ericmjl commented on 2021-12-28T00:58:25Z ----------------------------------------------------------------
I think the spelling of transform is incorrect. Should be transform_data.
drbenvincent commented on 2021-12-28T10:43:10Z ----------------------------------------------------------------
Typo fixed.
I will add some text (shortly) to mention why we are using point estimates.
drbenvincent commented on 2021-12-28T12:05:15Z ----------------------------------------------------------------
In fact, we've already got a note about point estimates under the heading "PyMC models for copula and marginal estimation"
ericmjl commented on 2021-12-28T19:38:33Z ----------------------------------------------------------------
Copying Oriol's point from the discussion to keep the discussion a bit easier to follow:
The note is about simultaneous estimation or fitting marginal and covariance models independently, but doesn't address marginalizing before or after the transformation between observed and multivariate spaces. The marginalization after transforming is one if the things I tried and did give the same result, but the two approaches giving the same result isn't straightforward (to me at least) as the operations don't commute with the expectation.
I think I see where Oriol's point - if the expectation of a distribution is E(dist), then exp(E(dist)) is not necessarily equal to E(exp(dist)).
However, we're doing an exp(logcdf(E(dist))) here, not exp(E(dist)). Oriol, I'll readily admit to being ignorant here, but does that change anything?
ericmjl commented on 2022-01-17T16:21:16Z ----------------------------------------------------------------
Copying Oriol's point from the GitHub PR tracker to keep the discussion easier to follow:
The thing is that even if the evaluation of the logcdf are the observations and thus not a random variable, it's parameters are, so I think it's something like E(probit(exp(logcdf(x|dist)))) or probit(exp(logcdf(x|E(dist)))) which doesn't seem like they must be the same
Upon thinking about your point, Oriol, I think it is easiest for now to leave a not on why we're not using the whole posterior. I think I get your point, and it makes sense, though I'm also aware that we should close things out before they get dragged on too long, and it's easiest to add a comment than to try to change the logic of the code.
Ben, please be on the lookout for a few more textual changes to push!
View / edit / reply to this conversation on ReviewNB
OriolAbril commented on 2021-12-27T21:56:32Z ----------------------------------------------------------------
This title is a level 1 title and should be level 2 title to preserve the right hierarchy. It should probably be updated too now that sample_posterior_predictive is not being used. Something about comparing inference and truths is probably more fitting
drbenvincent commented on 2021-12-28T10:42:38Z ----------------------------------------------------------------
Fixed
I think the spelling of transform is incorrect. Should be transform_data.
View entire conversation on ReviewNB
Typo fixed.
I will add some text (shortly) to mention why we are using point estimates.
View entire conversation on ReviewNB
In fact, we've already got a note about point estimates under the heading "PyMC models for copula and marginal estimation"
View entire conversation on ReviewNB