ibis
ibis copied to clipboard
feat: interdependent calculations within single call to .mutate()
Is your feature request related to a problem?
New user here. Very impressed with the package so far :)
I am not completely sure if I'm missing something, but I think that it is not possible to write something like this where c, dependent on calculated col b, is defined within the same .mutate() call:
(
df
.mutate(b = df.a + 1,
# n.b. reference to b below, but _ is not designed for this, it refers to output of prev step instead
c = _.b + 2)
)
instead multiple calls to mutate are needed, as below
(
df
.mutate(b = df.a + 1)
.mutate(c = _.b + 2)
)
obviously not too painful in this example, but once you have many columns to add with a few dependencies, you need a lot of calls to mutate and so things start to get messy
What is the motivation behind your request?
No response
Describe the solution you'd like
Some way of referencing new fields within same call to .mutate() would be nice
What version of ibis are you running?
9.1.0
What backend(s) are you using, if any?
Postgres
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
A problem arises during renaming: how should something like this behave?
df.mutate(b=df.b + 1, c=_.b + 2)
Which b would that use?
Thanks for the reply :)
Agree that with the current meaning of "_", that would be a problem
Two ideas:
- throw an exception if a user tries to both ~~rename~~alter an existing field and use the ~~same variable~~altered field within a single mutate. I think this might be the better option because of i) lower risk of a reader getting confused about what was intended and ii) no need to have another version of _ that disambiguates previous vs current mutate (see next idea)
- have a new version of "_" i.e. a different variable that can be used to refer to variables within the same mutate call. e.g. let's call it F:
df.mutate(b=df.b + 1, c=F.b + 2)
throw an exception if a user tries to both alter an existing field and use the same altered field within a single mutate
Right now, the behavior of this is unambiguous: _ always refers to the relation to the left of the last . the user wrote, so something like mutate(b=_.b + 1, c=_.b) is valid.
It would likely break a bunch of existing code to start raising an exception when someone renames with the same name and uses that field in other computed columns in the same mutate call.
Thanks, I see your point.
I guess if the idea behind this PR seemed worth pursuing, options could be:
- add deprecation warning for where currently people use code like
mutate(b=_.b + 1, c=_.b), then eventually remove that and throw instead - keep the current approach and allow
mutate(b=_.b + 1, c=_.b), but issue a warning that altering b and using previous b in same step is ambiguous/not recommended - keep current approach but issue no message and let people work it out for themselves :)
If you consider none of these viable then no problem, I hadn't foreseen this issue
Thanks!
I think option 4 might be to keep the current approach but document this behavior in our tutorials that use _ and also in the API docs (probably in select and friends), so that it's more obvious that the lack of support for reuse-in-the-same-mutate is intentional.
Thanks for engaging! Appreciate the feedback.
Sounds great to me, thanks for taking a look :)
Hey @hottwaj! Seems like @cpcloud's recommended path forward makes sense; would you like to raise a PR to make the docs more clear on this, else we can add an issue to our backlog.