bambi icon indicating copy to clipboard operation
bambi copied to clipboard

Internal refactor

Open tomicapretto opened this issue 10 months ago • 9 comments

This started as a refactor with the goal to change how we name the parameters of the response distribution. So far, we named the parent parameter with f"{response_name}_mean" and all the other parameters with f"{response_name}_{param_name}". I no longer think appending _mean is a sensible approach as the parent parameter can be something different than the mean. And also, I realized prepending the name of the response resulted in very long and confusing names. For that reason, I decided it's better to just use the name of the parameter (i.e. mu for the mean in many families, p for probability of success, kappa, sigma, and so on).

But that is not the only change. Many other changes come with this PR. I summarize them below, and they will be added to the changelog. After this is merged, we can have a 0.14.0 release.

Summary of changes

  • Use "__obs__" instead of f"{response_name}_obs" as the dimension name for the observation index. This makes it easier to avoid errors in our code base and results in a clear an unique name for all cases. Previously we had to worry about the possible different names of the response and their aliases.
  • Use param_name instead of f"{response_name}_{param_name" (mentioned above).
  • Use the name of the parent parameter instead of f"{response_name}_{mean}" (mentioned above).
  • kind in Model.predict() now use "params" and "response" instead of "mean" and "pps".
    • The old versions still work, with a future warning.
  • include_mean has been replaced by include_params in Model.fit(). The old version still works, with a future warning.
  • Wrap all the parameters of the response distribution (the likelihood) with a pm.Deterministic. The benefit is that the model graphs are clearer and we don't incur in any penalty related to the computation and storage of deterministics as they are not computed and stored when sampling. This is thanks to some recent developments in PyMC.
  • Keep bayeux-ml as the single direct JAX-related dependency. Bayeux itself requires all the other libraries providing sampling backends for us so we don't list them directly. This may change in the future if we need to pin specific versions but I hope that's not the case.
  • Implemented a function called create_posterior_bayeux that creates the xarray.Dataset that holds the samples from the posterior when doing inference via bayeux. This makes sure we use the right dim/coord names and coord levels.
  • The response component only holds response information about the response. Previously, the response component used to hold information about the parent parameter and the response component. There's a clearer separation of concerns now. As an example, to access things related to the linear predictor of the parent parameter in a normal model now we do model.components["mu"]. Before we needed model.response_component. Now, that component still exists but it doesn't hold information about mu.

Other important notes

  • I've reran all the examples.
  • I've adapted the tests and made sure they pass.
  • I'm pining PyMC from main, this is because truncated responses need a fix in there. We can change it once a new version (5.16?) is released.

~~EDIT Bayeux based inferences don't include the observed_data group. This is problematic if we want to do ppchecks. We need to add them.~~ Done

tomicapretto avatar Apr 14 '24 03:04 tomicapretto

This is temporarily suspended because the var_names parameter in pm.sample() is not working as expected https://github.com/pymc-devs/pymc/issues/7258, and we need it for the modifications here.

tomicapretto avatar Apr 15 '24 15:04 tomicapretto

Now there's a new blocker which is https://github.com/pymc-devs/pymc/issues/7312

And I would also like to have https://github.com/pymc-devs/pymc/pull/7290 merged. Currently, the progressbar has many updates which slows down the sampler in jupyter notebooks.

tomicapretto avatar May 13 '24 13:05 tomicapretto

Everything should be fixed now. The pyproject.toml file is pointing to the dev version of PyMC. This should be good to go once PyMC releases a new version.

Now I'm going to re-run examples.

tomicapretto avatar May 25 '24 20:05 tomicapretto

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

I like the changes, but I find include_params and kind="params" ambiguous. Could we use something like linear_predictor, linear_term or something like that?

aloctavodia avatar May 31 '24 14:05 aloctavodia

include_params

Thanks for the review! I'm not sure I understand what is the suggestion (i.e. if you want argument(s) called linear_predictor, linear_term or those should be argument values). I also see the current names are not the best as it's as "params" can be many things in a model.

tomicapretto avatar Jun 02 '24 15:06 tomicapretto

I am saying that the name "params" is very vague.

aloctavodia avatar Jun 02 '24 21:06 aloctavodia

@GStechschulte thanks for the review! Do you have any ideas for the issue @aloctavodia mentioned? I agree the current approach is a bit vague. But I don't have lots of ideas right now :D

tomicapretto avatar Jun 14 '24 14:06 tomicapretto

@tomicapretto @aloctavodia apologies for the delayed response. This notification slipped through my GitHub 😞

Before, include_mean would compute the "posterior of the mean response" which technically may not always be a mean as you stated above. This computation is $g^{-1}(\eta)$ which relates the linear predictor to the response. Since it is a response parameter, but not always the mean, maybe include_response_params. It's a longer name, but less vague?

To be consistent in model.predict:

  • kind="response_params" as calling model.fit(include_response_params=True) will result in the same InferenceData as the sequence of calls: (1) model.fit(include_response_params=False), then (2) model.predict(kind="response_params", data=None)

  • kind="response_obs" or response so users "know" posterior predictive samples are returned and the samples are on the response scale.

GStechschulte avatar Jun 21 '24 05:06 GStechschulte