ibis
ibis copied to clipboard
refactor(ir): remove DestructValue expressions
Defer DestructValue expression's functionality toops.StructField
BREAKING CHANGE: removed ir.DestructValue, ir.DestructScalar and ir.DestructColumn
The DestructValue expression is a pandas execution level optimization leaked to the expression API. In addition to that this expression type is the only one without a corresponding operation, hence the workaround code in analysis.py.
Due to the lack of a corresponding operation node #4135 requires this patch. The drawback caused by this PR is that the (rather specific) pandas execution optimization goes away (the reuse of UDF results returning a struct type).
This optimization should be implemented using a different approach. Either by refactoring the pandas backend to execute the nodes in a topological order and track the dependencies between nodes (experimented with it here) or pass a mutable cache object along the execute_node calls since the Scope objects cannot propagate upward in the execution tree.
Additionally I'd like to merge this PR to the 4.x.x branch which is basically the preparation for the intermediate expression removal. Since we don't have a clear consensus there yet, you can also think of this PR and the (4.x.x branch in general) as part of that effort. I'm trying to encapsulate all the independent changes into single commits, so we can easily cherry-pick the ones we agree on.
problem is we r still catching up on earlier breaking changes and cant even test anything like this
have found that our internal tests do catch a lot more than ibis tests
conceptually not averse to these but we need some time
problem is we r still catching up on earlier breaking changes and cant even test anything like this
have found that our internal tests do catch a lot more than ibis tests
would it be possible port some of the interval tests to improve upstream's coverage in order to prevent downstream test failures?
cc @icexelloss
would it be possible port some of the interval tests to improve upstream's coverage in order to prevent downstream test failures?
We have most tests that can be pushed to upstream already pushed, for example test_elementwise_udf_destruct_exact_once. The other tests depends on internal logics and is hard to push upstream.
With regard to this PR, in principle this change seems fine to me but I'd much prefer not to merge this to master without solving the pandas execution backend issue w.r.t duplicate execution. Since we use pandas backend in production internally, merging this to master without solving that would mean a pretty big performance regression and we cannot sync with ibis master any more.
A general comment as I have mentioned in some other PR as well is we struggle to keep up with the frequency of large refactoring in ibis, we continuously have people work on updating internal code for these refactors and every time we did that it seems it will just break again in another few weeks (we still have 15-20 follow up issues from previous upgrade that are not resolved..). It's quite painful and hard to sustain. I agree most of these changes are good but I hope these changes come in a slower manner.
@icexelloss Thanks for chiming in. Is there any way we could get some information about the things that are breaking?
If there are no BREAKING CHANGE commits between your current version and the next commit you track and things are breaking when using public ibis APIs then we'd really like to know about it so we don't accidentally break something and release it!
@cpcloud Thanks for chiming in..
I think we are in the tricky situation that lot of our code is not just build against the ibis public API (And to be fair there is no way to build our system just using the ibis public API) and we use the ibis expr/node level to build our code (e.g., adding new expr/node, creating sub classes of existing node etc, implement functions to traverse, rewrite expression trees). Ironically breaking public APIs are actually easier for us to fix and expr/node level changes are harder.
The way that we know something might be breaking is really pull the ibis changes and test our internals (for example, the change to make node immutable is merged, we pulled that in and fixes various subclasses that has unhashable args, i.e., list and dict), but that's is not easy even for us to know what things might be breaking until we actually upgrade it.
That being said, I don't have a good idea how best to deal with the situation since we are building heavily on the expr/node layer which is not stable.
One thing I can think of is we can try to push some of the stuff we have so that at least those are more protected from future refactors, e.g.,
- https://github.com/ibis-project/ibis/issues/3987
- Tree rewrite libraries (we call this rule libraries which takes a ibis tree, perform transformation and output an ibis tree)
- Some of the custom node, i.e, TableNode backed by pandas DataFrame, and Spark DataFrame (we call these external tables)
@icexelloss It sounds like you are tracking master or working off of a commit that isn't part of a release.
That has the potential to cause frequent breakage that no approach other than "things will never break" can protect against. The current policy is to follow semantic versioning, which means breaking changes between major releases and every other release doesn't break anything. Tracking master may or may not work in that scenario, depending on what the next release will be.
Considering that the time between 2.x and 3.x was about 6 months, what about tracking released versions instead and selectively applying patches on top of the release internally as needed?
Then the work to upgrade can be batched to happen after a major release, which will generally be much less frequent than non major releases.
With regard to this PR, in principle this change seems fine to me but I'd much prefer not to merge this to master without solving the pandas execution backend issue w.r.t duplicate execution. Since we use pandas backend in production internally, merging this to master without solving that would mean a pretty big performance regression and we cannot sync with ibis master any more.
I understand the importance of this problem, so let's try to figure out a proper way to re-implement it. I can see three approaches here, though its challenging due to the top->down nature of the execution:
- Introduce a new mutable cache variable and pass that to the
execute_nodefunctions and handle the reuse of struct UDFs manually, though this require some lifetime handling. - A more generic version of the previous step where a first traversal on the tree would create a result store tracking which nodes require which nodes' results, so we could immediately free up results not required anymore. This could actually replace the
Scopeobject while keeping the top to bottom execution flow, but not sure how to handle the timecontext in this case. - Rewrite the execution model to follow topological order, this way the nodes' dependencies get calculated once (similarly to the previous approach) but in a bottom to up execution flow. This would further improve (I assume significantly) the execution speed since it would reuse all the intermediate results, not just the struct types. I have an incomplete prototype for it, a couple of pointers:
I stopped working on my prototype since it requires more advanced rewrites for aggregations where it would be preferable to use a generic tree rewrite tool (like matchpy or a logical programming approach like kanren). Since you mentioned it as one of the features you have, I'd be especially curious for your implementation. *
Any other ideas how could we re-add support for it?
I think we are in the tricky situation that lot of our code is not just build against the ibis public API (And to be fair there is no way to build our system just using the ibis public API) and we use the ibis expr/node level to build our code (e.g., adding new expr/node, creating sub classes of existing node etc, implement functions to traverse, rewrite expression trees). Ironically breaking public APIs are actually easier for us to fix and expr/node level changes are harder.
I think it's related to the technical dept laying around the internal parts of ibis, ideally this shouldn't be that hard.
The way that we know something might be breaking is really pull the ibis changes and test our internals (for example, the change to make node immutable is merged, we pulled that in and fixes various subclasses that has unhashable args, i.e., list and dict), but that's is not easy even for us to know what things might be breaking until we actually upgrade it.
That being said, I don't have a good idea how best to deal with the situation since we are building heavily on the expr/node layer which is not stable.
I've rewritten most of the backends in https://github.com/ibis-project/ibis/pull/4219 to the new intermediate representation without intermediate expressions just operation nodes, and it was a pretty straightforward. Actually it was a much more pleasant experience than I expected. @cpcloud has rewritten a couple of backends too, I think he can confirm that porting code to the new representation is way easier than it looks like at first glance.
One thing I can think of is we can try to push some of the stuff we have so that at least those are more protected from future refactors, e.g.,
Actually that would be great!
Didn't see that issue but seems reasonable depending on the "stableness" requirements. Like having another method like __hash__ but with md5 instead of the builtin hash()
- Tree rewrite libraries (we call this rule libraries which takes a ibis tree, perform transformation and output an ibis tree)
* As mentioned above, a generic rule based tree rewrite solution is something we want to have, so this is something I'm really interested in.
- Some of the custom node, i.e, TableNode backed by pandas DataFrame, and Spark DataFrame (we call these external tables)
Sounds like another useful feature which seemingly implies multi-backend execution.
Test Results
35 files 35 suites 1h 10m 54s :stopwatch: 8 879 tests 7 085 :heavy_check_mark: 1 794 :zzz: 0 :x: 32 532 runs 25 704 :heavy_check_mark: 6 828 :zzz: 0 :x:
Results for commit 08ddab37.
:recycle: This comment has been updated with latest results.
I appreciate the thoughts. Sorry for the delayed reply I don't have a very good system to set up to not miss conversations like this :(
Re @cpcloud
Considering that the time between 2.x and 3.x was about 6 months, what about tracking released versions instead and selectively applying patches on top of the release internally as needed?
I think this is an option and something we did in the past (we were off master branch / cherry for a couple of month then took a major effort to sync with default). What I was concerned about doing this is considering the structural difference between 3.x and 4.x, all features/bug fixes are likely to have conflicts applying to both branches so we would have to deal with for a while. The other concerns I have is for things like ibis-substrait. I don't know if you have plan to maintain ibis-substrait for 3.x and 4.x but ibis-substrait is something pretty important for us. If you don't plan to maintain ibis-substrait for 3.x and internally we are on ibis 3.x, it means we have to maintain a ibis-substrait 3.x branch too. So I think the trade off is not clear which way is better.
Re @kszucs I can look into push some of the stuff upstream. The rule based rewrite library is somewhat standalone but still needs clean up before we can push it upstream.
@jreback @icexelloss @cpcloud I forwarded a global cache mapping to the execute nodes to preserve the cache behaviour for struct udfs.
Added it only to the elementwise udfs since the others are not covered with tests. Another option would be to make it pluggable by passing cache to pre_execute and post_execute and let the user decide what to place in the cache.
I think we could go ahead and merge it in order to unblock the upcoming developments, and address possible problems later.
Codecov Report
Merging #4217 (08ddab3) into 4.x.x (448f587) will increase coverage by
7.47%. The diff coverage is100.00%.
@@ Coverage Diff @@
## 4.x.x #4217 +/- ##
==========================================
+ Coverage 85.31% 92.79% +7.47%
==========================================
Files 180 180
Lines 20151 20088 -63
Branches 2876 2853 -23
==========================================
+ Hits 17192 18640 +1448
+ Misses 2624 1090 -1534
- Partials 335 358 +23
| Impacted Files | Coverage Δ | |
|---|---|---|
| ibis/expr/api.py | 97.46% <ø> (ø) |
|
| ibis/expr/format.py | 93.66% <ø> (-0.15%) |
:arrow_down: |
| ibis/expr/operations/structs.py | 96.42% <ø> (-0.13%) |
:arrow_down: |
| ibis/backends/dask/core.py | 94.87% <100.00%> (+0.13%) |
:arrow_up: |
| ibis/backends/dask/execution/structs.py | 100.00% <100.00%> (ø) |
|
| ibis/backends/dask/execution/util.py | 87.50% <100.00%> (-1.24%) |
:arrow_down: |
| ibis/backends/dask/udf.py | 97.39% <100.00%> (+0.90%) |
:arrow_up: |
| ibis/backends/pandas/core.py | 96.39% <100.00%> (+0.06%) |
:arrow_up: |
| ibis/backends/pandas/execution/structs.py | 100.00% <100.00%> (ø) |
|
| ibis/backends/pandas/execution/util.py | 92.30% <100.00%> (+8.70%) |
:arrow_up: |
| ... and 20 more |
@jreback @icexelloss Please review.
The API is now preserved, with the original test verifying that the UDF is executed once passing.
We'd like to merge this PR this week.
will do soon