ibis
ibis copied to clipboard
refactor: destructure columns into struct fields inside mutate API
Currently, when creating a Selection node, there is branching logic on whether any column in the selection is an assignment that overwrites an existing column in the table, at which point the node stores a list of all column exprs in the table, instead of just the table and the column assignments. There is additional special logic required here to:
- catch the case where there is an overwrite inside a DestructColumn, and
- correctly order the result columns after execution, since a DestructColumn packages multiple columns together inside a single expr, and we specifically want to ensure that new columns appear at the end of the materialized output.
A better solution is to destructure the columns inside the mutate API, so for example a DestructColumn containing three columns will be destructured into three ColumnExprs with StructField ops at expression creation time. This will make it easier to ensure correct column ordering.
@jreback I'll be removing the milestone here, please let me know if this is a blocker for Ibis 2.0, and I'll set it back.
@emilyreff7 it would be great if you can provide a code example that shows the current behaviour, and what you'd like the proposed behaviour to be instead.
@emilyreff7 Is there a behavior change here, or is this an internal refactoring with no user facing API changes?
This is the original PR that motivated this follow-up issue, and the overview section there should give sufficient context.
The tldr; is the above PR fixed a bug whereby overwriting an existing column via a multi-column UDF used to fail. The limitation of that PR was that the ordering of the output columns may not be entirely correct, and I give an example of that in the overview.
IIRC, at the time I played around with doing this destructuring, there were some unforeseen performance implications in the pandas backend, possibly due to the way the pandas backend handles scope (e.g. it seemed that the different struct fields didn't share the same scope, so we ended up recomputing the entire UDF for each struct field), so we left the PR as the baseline implementation that merely fixed the bug, with this issue as a follow-up.
It would be a user-facing change in the sense that correct column ordering could be achieved in this particular use case.
This was addressed in the recent merge of 4.x.x to master #4512.