pymc icon indicating copy to clipboard operation
pymc copied to clipboard

Add support for automatic naming of random variables

Open zaxtax opened this issue 2 years ago • 34 comments

This PR introduces a way to use distributions without needing to provide a name.

This means:

import pymc as pm
with pm.Model():
    x = pm.Normal(0., 1.)

is equivalent to

import pymc as pm
with pm.Model():
    x = pm.Normal("x", 0., 1.)

I use some code originally written by @Tobias-Kohn for pyprob. The implementation is done by inspecting stack frames to figure out what the assignment will be. It picks earliest point an assignment can take place.

This PR is intended to be fully backwards-compatible.

Checklist

  • [x] Explain important implementation details 👆
  • [x] Make sure that the pre-commit linting/style checks pass.
  • [x] Are the changes covered by tests and docstrings?
  • [x] Fill out the short summary sections 👇

zaxtax avatar Dec 03 '22 05:12 zaxtax

Codecov Report

Merging #6364 (97b1f58) into main (5688555) will decrease coverage by 0.02%. Report is 434 commits behind head on main. The diff coverage is 84.31%.

:exclamation: Current head 97b1f58 differs from pull request most recent head e835fa0. Consider uploading reports for the commit e835fa0 to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6364      +/-   ##
==========================================
- Coverage   94.72%   94.71%   -0.02%     
==========================================
  Files         132      132              
  Lines       26740    26789      +49     
==========================================
+ Hits        25330    25372      +42     
- Misses       1410     1417       +7     
Files Coverage Δ
pymc/tests/distributions/test_distribution.py 98.31% <100.00%> (+0.09%) :arrow_up:
pymc/distributions/distribution.py 93.18% <78.94%> (-2.00%) :arrow_down:

codecov[bot] avatar Dec 03 '22 05:12 codecov[bot]

@pymc-devs/dev-core Weight in :)

ricardoV94 avatar Dec 03 '22 07:12 ricardoV94

Amazing!

twiecki avatar Dec 03 '22 09:12 twiecki

I remember that I didn't like this pattern before, because it messes up the typing of the arguments =/

I know that the typing of distribution arguments isn't great in the first place, but we're also talking about refactoring all these to functions. For functions we could have clean signatures and that would be a great improvement of UX too.

How would the signature of pm.Normal look like if it was a function?

Can this be done with overloads?

michaelosthege avatar Dec 03 '22 10:12 michaelosthege

I remember that I didn't like this pattern before, because it messes up the typing of the arguments =/

I know that the typing of distribution arguments isn't great in the first place, but we're also talking about refactoring all these to functions. For functions we could have clean signatures and that would be a great improvement of UX too.

How would the signature of pm.Normal look like if it was a function?

Can this be done with overloads?

As you mentioned this doesn't change the typing too much as is. If we were to change the distributions to functions, I think this could be handled with an overload. One with a explicit name argument, and one without.

Most of the hackiness of this comes from trying to maintain full backwards-compatibility. If we were ok with an API break, this just gets typed with name becoming an optional argument.

zaxtax avatar Dec 03 '22 14:12 zaxtax

Very cool. We have been wanting to infer names since the days of PyMC2.

fonnesbeck avatar Dec 03 '22 15:12 fonnesbeck

This is a fun and interesting idea, and I really enjoy the inventiveness of the implementation!

However, I'm not sure I really would want this to be in pymc. The main assumption of this is that people want to use the same name for a variable binding as for the (global) name of the variable in the trace. I just went through a couple of my models, and at least for me that just isn't the case. Less than half of the variables had the same name as the binding, especially for larger models, where variable names tend to get longer because they have to stay globally unique, and part of the model creation is in separate functions with local namespaces. I had a quick look at models on discourse, and there I think the percentage looks like more than half, but still very far from 100%. There are lots of models that use different names for the local bindings, and I think often for good reason.

Also from a more abstract perspective this seems a bit unintuitive: the global name of a variable and the name of the variable binding are quite different things, and if we did this I think pymc would be the only library I know that uses binding names internally. It also goes very much against general rules in python, where the code where a function is called doesn't matter inside the call. It also adds a lot of strangeness in the argument order of the RV constructors, where we need quite a bit of logic to figure out if there is a name or not.

This seems to me to trade the small inconvenience of repeating a name from time to time for unique strangeness that goes against how python usually works at its core. That doesn't sound like a good tradeoff to me...

If we want to improve variable naming, I think it would be more interesting to think about namespaces. I've been waiting for xarray to eventually support nesting, if that were to happen I think that would help greatly

aseyboldt avatar Dec 04 '22 01:12 aseyboldt

How does this fare with Distribution constructor helpers such as ZeroInflatedPoisson, OrderedLogistic, LKJCholeskyCov, which are not distributions themselves?

Users should expect the same behavior but I don't want to make these cumbersome to write for developers.

ricardoV94 avatar Dec 04 '22 05:12 ricardoV94

Less than half of the variables had the same name as the binding, especially for larger models, where variable names tend to get longer because they have to stay globally unique, and part of the model creation is in separate functions with local namespaces.

This won't prevent users from using names, so I don't see how this is an argument against. Of course users didn't bother to give unique assignment names in the past, it didn't matter. Obviously that changes if you are using assignment to provide the name. If you repeat it you'll get the usual repeated name error from the Model.

What do you mean with local namespaces?

ricardoV94 avatar Dec 04 '22 05:12 ricardoV94

This won't prevent users from using names, so I don't see how this is an argument against. Of course users didn't bother to give unique assignment names in the past, it didn't matter. Obviously that changes if you are using assignment to provide the name. If you repeat it you'll get the usual repeated name error from the Model.

True, it still is possible to specify the name. But this seems like a pretty big change to me, after all with this we are effectively changing how a call in python works. This might be a good idea if that is almost always what we want, but if it isn't, do we really want to make such a deep and unexpected change? Just imaging the confusion of someone familiar with python but not pymc who is reading a model. Somehow there are names in the trace, but those names were never actually passed into any function anywhere! Or someone might rename a local variable (which never changes behavior in any python code I've seen so far), and then some distant code that analyses the trace breaks.

What do you mean with local namespaces?

I mean things like

def make_spacial_compontent():
    mu = pm.Normal("spacial_mu")
    sigma = pm.HalfNormal("spacial_sigma")
    ...

So the global variable names are more complicated, but because we only ever see part of that in the function we use other variable names there.

aseyboldt avatar Dec 04 '22 05:12 aseyboldt

What do you mean with local namespaces?

I mean things like

def make_spacial_compontent():
 mu = pm.Normal("spacial_mu") 
 sigma = pm.HalfNormal("spacial_sigma") 
 ... 

So the global variable names are more complicated, but because we only ever see part of that in the function we use other variable names there.

If you pass a name to the model you get it as the prefix of every variable, so you could use a nested model for that already with this PR

Edit: Still I don't see the issue as long as you can pass names explicitly.

ricardoV94 avatar Dec 04 '22 05:12 ricardoV94

@aseyboldt I see your points, but I don't think you are the target audience for this feature. For me, this is really focused at beginners, and to them it's often not clear why they have to repeat things. Further optimizing the simplicity and readability of the HelloWorld model is pretty important I think. Intermediate and advanced users will probably keep using names.

twiecki avatar Dec 04 '22 12:12 twiecki

This is a fun and interesting idea, and I really enjoy the inventiveness of the implementation!

However, I'm not sure I really would want this to be in pymc. The main assumption of this is that people want to use the same name for a variable binding as for the (global) name of the variable in the trace. I just went through a couple of my models, and at least for me that just isn't the case. Less than half of the variables had the same name as the binding, especially for larger models, where variable names tend to get longer because they have to stay globally unique, and part of the model creation is in separate functions with local namespaces. I had a quick look at models on discourse, and there I think the percentage looks like more than half, but still very far from 100%. There are lots of models that use different names for the local bindings, and I think often for good reason.

Also from a more abstract perspective this seems a bit unintuitive: the global name of a variable and the name of the variable binding are quite different things, and if we did this I think pymc would be the only library I know that uses binding names internally. It also goes very much against general rules in python, where the code where a function is called doesn't matter inside the call. It also adds a lot of strangeness in the argument order of the RV constructors, where we need quite a bit of logic to figure out if there is a name or not.

This seems to me to trade the small inconvenience of repeating a name from time to time for unique strangeness that goes against how python usually works at its core. That doesn't sound like a good tradeoff to me...

If we want to improve variable naming, I think it would be more interesting to think about namespaces. I've been waiting for xarray to eventually support nesting, if that were to happen I think that would help greatly

Thanks @aseyboldt for your thoughtful feedback. I think these are valid points and I'm not even sure beginners mind having to label their random variables.

As I mentioned to Michael, much of the hackiness of this originates from trying to maintain full backwards-compatibility. Right now we have lots of user code that treats the name as the first argument. The metaprogramming bits though while very odd are no more odd than what happens with Model where we enter a context and now magically an object is modified behind the scenes. A small method that uses the surrounding scope seems like a more isolated form of magic.

There are also larger design questions that maybe could get us to the same destination but using a more elegant route. I'm not sure we need names for the random variables. They seem to me to be mostly needed for visualisation and diagnostic plots.

zaxtax avatar Dec 04 '22 16:12 zaxtax

I'm not sure we need names for the random variables.

We need names to link variables to sampling results. Otherwise you don't know what is what in the results of prior predictive and sample. Posterior predictive, likewise, relies on names to know what's provided and what needs to be resampled.

Storing and loading results with unnamed TensorVariables as the keys would be a mess.

ricardoV94 avatar Dec 04 '22 16:12 ricardoV94

Storing and loading results with unnamed TensorVariables as the keys would be a mess.

I think there is a small mess that comes from needing to hold onto the TensorVariables to use as keys. The big mess comes when you need to give them names to be plotted.

This comes up in other PPLs like BeanMachine where random variables are decorated functions. The name became the function's name concatenated with its arguments. Which leads to plots that usually don't look the cleanest.

zaxtax avatar Dec 04 '22 17:12 zaxtax

Will this work for unicode variable names? People have a habit of using unicode mu, sigma, beta, etc. If yes, it's worth adding some tests for that.

PS. I think this is great. Think it will allow for much easier on-ramping of beginners.

drbenvincent avatar Dec 04 '22 18:12 drbenvincent

Will this work for unicode variable names? People have a habit of using unicode mu, sigma, beta, etc. If yes, it's worth adding some tests for that.

PS. I think this is great. Think it will allow for much easier on-ramping of beginners.

Seems to work with unicode characters

zaxtax avatar Dec 04 '22 18:12 zaxtax

As Python has full unicode support I would expect it to just work, I don't think we need a test for that.

twiecki avatar Dec 05 '22 09:12 twiecki

Manual naming would have to be retained in order to support things like re-usable model components. For example,

    def component_submodel(name, ...):

        # Aging curve
        c_age = pm.HalfNormal(f"c_age_{name}", 10)
        offset_age = pm.HalfNormal(f"offset_age_{name}", 1)

        ...

where the model instantiates several component_submodels that each have their own instances of all of the random variables created inside the function. There does not appear to be an automatic way of naming these.

fonnesbeck avatar Dec 05 '22 18:12 fonnesbeck

Well, it seems like I'm the only one to who this looks like too much magic, if that's the case then I guess it's fine. :-) I'll just argue a little bit more, to at least make sure we don't miss anything...

I guess the reason why this feels like too much magic to me is that this API is not really part of the python language: If you only use API that is not CPython specific, you can not (as far as I know) implement this API at all. In order to implement it, we have to use internal data of CPython that isn't really meant to be used when programming in the language, and that other Python implementations (or future versions of CPython) might not even have. To quote from the CPython docs about some of those functions:

https://docs.python.org/3/library/sys.html#sys._getframe

CPython implementation detail: This function should be used for internal and specialized purposes only. It is not guaranteed to exist in all implementations of Python.

https://docs.python.org/3/reference/datamodel.html#the-standard-type-hierarchy (for the frame attributes)

Internal types A few types used internally by the interpreter are exposed to the user. Their definitions may change with future versions of the interpreter, but they are mentioned here for completeness.

https://docs.python.org/3/library/dis.html?highlight=dis#module-dis

CPython implementation detail: Bytecode is an implementation detail of the CPython interpreter. No guarantees are made that bytecode will not be added, removed, or changed between versions of Python. Use of this module should not be considered to work across Python VMs or Python releases.

At least bytecode also has changed quite a bit over the years as far as I am aware.

This is in contrast to something like the model context, that's just a glorified global variable, and only uses quite simple features of the python language.

I think it is fair to say that an API that can not be implemented in the language is a whole different level of magic. :-)

Given that I don't think it is actually that helpful or at least important, (and @twiecki I think breaking rules of the language is actually worse for beginners than for people who have worked with that for a long time), it just doesn't feel right to me. Answering the question "Why do I have to reapeat the name" seems quite easy to me: "You are not really repeating anything, the string argument is the name in the trace, the name of the assignment is the local binding of that variable. Those two things don't really have anything to do with each other, only that maybe in some cases you might want to choose the same name.")

This also raises some compatibility questions that I think we really should investigate at least a bit before we commit to anything like this: It is quite possible that this can't actually be implemented at all in other python implementations, including future CPython versions. I had a quick look and maybe pypy would work, but I'm not sure. We might also want to look at pyston, because I think there is some talk of using similar jit ideas in future CPython versions. I don't think it is particularly likely, but it really would be unfortunate if we go for an API like this, and a few CPython versions later this suddenly can't be done anymore, or it can only be done in a different subset of cases.

aseyboldt avatar Dec 05 '22 20:12 aseyboldt

Well, damn, you raise some good points @aseyboldt, need to let that sink in a bit more. One nit: "I think it is fair to say that an API that can not be implemented in the language is a whole different level of magic. :-)" I wouldn't say it's the API per se, because it's still just a function call, but rather the code implementing the API that's not Python.

twiecki avatar Dec 05 '22 20:12 twiecki

Adrian isn't the only one with concerns.

If we do this, would we update the documentation to use this feature? All of it? Where would we draw the line?

It's clearly violating the "explicit is better than implicit" rule, and because all of it is happening behind the scenes, it may create more confusion than benefits. After all, the usual API of passing names may be verbose, but it's definitely easy to understand. In particular for examples that opt for informative RV names and short variable names (s = pm.Normal("slope")).

In the end I think its about trading off slope in the very early vs. medium stage learning curve. (Seemingly easier Quickstart examples at the cost of confusion about moderately complex models.)

michaelosthege avatar Dec 05 '22 22:12 michaelosthege

It's also true that after the first slightly negative reaction, people get over it pretty quickly.

On Mon, Dec 5, 2022, 23:14 Michael Osthege @.***> wrote:

Adrian isn't the only one with concerns.

If we do this, would we update the documentation to use this feature? All of it? Where would we draw the line?

It's clearly violating the "explicit is better than implicit" rule, and because all of it is happening behind the scenes, it may create more confusion than benefits. After all, the usual API of passing names may be verbose, but it's definitely easy to understand. In particular for examples that opt for informative RV names and short variable names (x = pm.Data("time", [1,2,3])).

In the end I think its about trading off slope in the very early vs. medium stage learning curve. (Seemingly easier Quickstart examples at the cost of confusion about moderately complex models.)

— Reply to this email directly, view it on GitHub https://github.com/pymc-devs/pymc/pull/6364#issuecomment-1338249586, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFETGGDLGN3WOTMAZSCBUDWLZSMJANCNFSM6AAAAAASSRXDM4 . You are receiving this because you were mentioned.Message ID: @.***>

twiecki avatar Dec 05 '22 22:12 twiecki

@aseyboldt makes some very good points. While I think having the option would be great, it sounds like it's not really an option. Maybe we can better sell why the current apparently verbose approach is beneficial.

drbenvincent avatar Dec 06 '22 07:12 drbenvincent

I don't think there is any benefit to the current approach, it's just the best we could do within Python's limitations (other than going the JAGS way, and defining models as large strings :D). You have to be verbose. Making that optional would be great.

I agree that this is very inelegant, because we're trying to break out of Python's language. Still I think it's fine to exploit features only available in the CPython implementation. It's not like we are testing or explicitly supporting any other Python implementations.

The strongest argument against this approach is that the byte-code can be changed at any time rendering this approach impossible / cumbersome in the future. It would suck to have to backtrack.

ricardoV94 avatar Dec 06 '22 07:12 ricardoV94

I don't have the usual concerns with "too much magic" or being implicit rather than explicit in this case, because ultimately we are talking about naming variables here. If it saves users time and/or makes the model code more concise, then there is not really any downside to me, as long as the approach is robust and allows for corner cases to be accommodated. Moreover, as long as we retain the option to manually override with the name argument, then it should be backward-compatible.

fonnesbeck avatar Dec 06 '22 15:12 fonnesbeck

I don't think there is any benefit to the current approach, it's just the best we could do within Python's limitations (other than going the JAGS way, and defining models as large strings :D). You have to be verbose. Making that optional would be great.

I agree that this is very inelegant, because we're trying to break out of Python's language. Still I think it's fine to exploit features only available in the CPython implementation. It's not like we are testing or explicitly supporting any other Python implementations.

The strongest argument against this approach is that the byte-code can be changed at any time rendering this approach impossible / cumbersome in the future. It would suck to have to backtrack.

In terms of the bytecode, while it can and does change a little every version release, this code seems to work across many version releases even going as far back as python 2.x

zaxtax avatar Dec 09 '22 13:12 zaxtax

In terms of the bytecode, while it can and does change a little every version release, this code seems to work across many version releases even going as far back as python 2.x

That makes me feel better.

twiecki avatar Dec 09 '22 13:12 twiecki

Could you also add a test where the model has a name? To me, it looks like the variable autoname will still have the model name prepended to it, but I think that we should explicitly test that it does.

lucianopaz avatar Dec 23 '22 09:12 lucianopaz

Closing this as the reception was pretty mixed. Thanks anyway @zaxtax!

ricardoV94 avatar Jun 16 '23 07:06 ricardoV94