ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: interdependent calculations within single call to .mutate()

Open JonAnCla opened this issue 1 year ago • 7 comments

Is your feature request related to a problem?

New user here. Very impressed with the package so far :)

I am not completely sure if I'm missing something, but I think that it is not possible to write something like this where c, dependent on calculated col b, is defined within the same .mutate() call:

(
    df
    .mutate(b = df.a + 1,
                   # n.b. reference to b below, but _ is not designed for this, it refers to output of prev step instead
                   c = _.b + 2)
)

instead multiple calls to mutate are needed, as below

(
    df
    .mutate(b = df.a + 1)
    .mutate(c = _.b + 2)
)

obviously not too painful in this example, but once you have many columns to add with a few dependencies, you need a lot of calls to mutate and so things start to get messy

What is the motivation behind your request?

No response

Describe the solution you'd like

Some way of referencing new fields within same call to .mutate() would be nice

What version of ibis are you running?

9.1.0

What backend(s) are you using, if any?

Postgres

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

JonAnCla avatar Jul 04 '24 12:07 JonAnCla

A problem arises during renaming: how should something like this behave?

df.mutate(b=df.b + 1, c=_.b + 2)

Which b would that use?

cpcloud avatar Jul 04 '24 13:07 cpcloud

Thanks for the reply :)

Agree that with the current meaning of "_", that would be a problem

Two ideas:

  1. throw an exception if a user tries to both ~~rename~~alter an existing field and use the ~~same variable~~altered field within a single mutate. I think this might be the better option because of i) lower risk of a reader getting confused about what was intended and ii) no need to have another version of _ that disambiguates previous vs current mutate (see next idea)
  2. have a new version of "_" i.e. a different variable that can be used to refer to variables within the same mutate call. e.g. let's call it F: df.mutate(b=df.b + 1, c=F.b + 2)

JonAnCla avatar Jul 04 '24 13:07 JonAnCla

throw an exception if a user tries to both alter an existing field and use the same altered field within a single mutate

Right now, the behavior of this is unambiguous: _ always refers to the relation to the left of the last . the user wrote, so something like mutate(b=_.b + 1, c=_.b) is valid.

It would likely break a bunch of existing code to start raising an exception when someone renames with the same name and uses that field in other computed columns in the same mutate call.

cpcloud avatar Jul 04 '24 13:07 cpcloud

Thanks, I see your point.

I guess if the idea behind this PR seemed worth pursuing, options could be:

  1. add deprecation warning for where currently people use code like mutate(b=_.b + 1, c=_.b), then eventually remove that and throw instead
  2. keep the current approach and allow mutate(b=_.b + 1, c=_.b), but issue a warning that altering b and using previous b in same step is ambiguous/not recommended
  3. keep current approach but issue no message and let people work it out for themselves :)

If you consider none of these viable then no problem, I hadn't foreseen this issue

Thanks!

JonAnCla avatar Jul 04 '24 14:07 JonAnCla

I think option 4 might be to keep the current approach but document this behavior in our tutorials that use _ and also in the API docs (probably in select and friends), so that it's more obvious that the lack of support for reuse-in-the-same-mutate is intentional.

Thanks for engaging! Appreciate the feedback.

cpcloud avatar Jul 04 '24 14:07 cpcloud

Sounds great to me, thanks for taking a look :)

JonAnCla avatar Jul 04 '24 15:07 JonAnCla

Hey @hottwaj! Seems like @cpcloud's recommended path forward makes sense; would you like to raise a PR to make the docs more clear on this, else we can add an issue to our backlog.

deepyaman avatar Jul 05 '24 17:07 deepyaman