ibis
ibis copied to clipboard
refactor: remove intermediate expressions
Both the core and sqlite tests pass now, result which is a pretty good baseline to work with.
Hello @kszucs! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
- In the file
ibis/backends/base/sql/compiler/translator.py:
Line 379:40: F821 undefined name 'to'
- In the file
ibis/backends/base/sql/registry/string.py:
Line 36:35: F821 undefined name 'start'
- In the file
ibis/expr/lineage.py:
Line 136:1: F723 syntax error in type comment 'type = ir.Expr,' Line 146:80: E501 line too long (81 > 79 characters)
- In the file
ibis/expr/operations/core.py:
- In the file
ibis/expr/operations/generic.py:
Line 21:1: F401 'ibis.expr.operations.core.Node' imported but unused
- In the file
ibis/expr/operations/relations.py:
Line 138:80: E501 line too long (87 > 79 characters) Line 153:80: E501 line too long (83 > 79 characters) Line 157:80: E501 line too long (82 > 79 characters) Line 158:80: E501 line too long (82 > 79 characters)
- In the file
ibis/expr/rules.py:
- In the file
ibis/expr/types/core.py:
Line 19:1: F401 'ibis.common.grounds.Base' imported but unused
Comment last updated at 2022-06-22 17:50:47 UTC
Unit Test Results
35 files 35 suites 1h 3m 47s :stopwatch: 8 597 tests 4 837 :heavy_check_mark: 1 654 :zzz: 1 863 :x: 243 :fire: 31 523 runs 14 208 :heavy_check_mark: 6 285 :zzz: 10 769 :x: 261 :fire:
For more details on these failures and errors, see this check.
Results for commit 5800bb3d.
:recycle: This comment has been updated with latest results.
Codecov Report
Merging #4135 (cff3d1e) into 4.x.x (bd75053) will decrease coverage by
8.11%. The diff coverage is87.70%.
:exclamation: Current head cff3d1e differs from pull request most recent head 80e110e. Consider uploading reports for the commit 80e110e to get more accurate results
@@ Coverage Diff @@
## 4.x.x #4135 +/- ##
==========================================
- Coverage 93.03% 84.92% -8.12%
==========================================
Files 158 157 -1
Lines 18346 17702 -644
Branches 2670 2630 -40
==========================================
- Hits 17069 15033 -2036
- Misses 960 2339 +1379
- Partials 317 330 +13
| Impacted Files | Coverage Δ | |
|---|---|---|
| ibis/backends/duckdb/datatypes.py | 60.41% <ø> (-39.59%) |
:arrow_down: |
| ibis/backends/impala/client.py | 0.00% <0.00%> (-87.83%) |
:arrow_down: |
| ibis/backends/impala/compiler.py | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| ibis/backends/impala/pandas_interop.py | 0.00% <0.00%> (-91.31%) |
:arrow_down: |
| ibis/backends/mysql/registry.py | 0.00% <0.00%> (-96.08%) |
:arrow_down: |
| ibis/config.py | 85.71% <ø> (+9.00%) |
:arrow_up: |
| ibis/backends/base/sql/registry/window.py | 27.27% <25.92%> (-61.83%) |
:arrow_down: |
| ibis/backends/duckdb/registry.py | 27.90% <47.36%> (-64.32%) |
:arrow_down: |
| ibis/backends/base/sql/registry/aggregate.py | 77.27% <55.55%> (-22.73%) |
:arrow_down: |
| ibis/udf/vectorized.py | 81.17% <55.55%> (-12.50%) |
:arrow_down: |
| ... and 122 more |
stumbled upon this..
I wonder if the main idea to use Node as arguments for Node (instead of using Expr)? This looks like a pretty aggressive change and I wonder what's the main driving reason for this?
@icexelloss Thanks for chiming in!
There are a few reasons why we're undertaking this change.
We want to tackle big ticket items all of which are made unreasonably difficult by the current indirection between Exprs and Nodes.
Here's an example that's been top of mind for us recently.
Replacing the functionality in analysis.py with a rule based rewrite system
I prototyped the beginnings of such as system here based on pattern matching using matchpy.
Bugs in analysis.py have the worst combination of effects: they tend to propagate to many parts of ibis unpredictably and are difficult to debug. The code is hard to reason about because of the way its current pattern matching is implemented. The rules are encoded ad-hoc as complex conditionals and the actions taken when the conditionals are True are difficult to follow.
Here's a subset of issues and pull requests directly related to analysis.py:
- #1045
- #1387
- #2427
- #2830
- #3341
- #3357
- #3577
- #4005
To implement this pattern matching based system, we need to canonicalize the internal representation to Node, and avoid jumping back and forth internally between Expr and Node.
Here a few outcomes that we're targeting for the end state of these refactors:
analysis.pyis completely subsumed by the new system- End-user APIs require no changes (APIs based on use of internals like
Nodesubclasses will unfortunately break) - Generated SQL code is nearly as good as it would be if written by hand. For example, no unnecessary projections, fused and pushed-down predicates, column/projection pruning. In some cases this will help backends better optimize plans resulting in both more readable target code and faster execution.
- Conveniences like #2703 can be encoded by maintainers with relative ease as long as a pattern can be written to describe the transformation.
Thanks @cpcloud. I agree that analysis.py is bad and happy to see some effort to replace it. However, I do have some concern and questions about this.
To implement this pattern matching based system, we need to canonicalize the internal representation to Node, and avoid jumping back and forth internally between Expr and Node
Other than this, are there other reason that the Node/Expr structure are bad (I understand the annoyance of having to jump back and forth in various places, which is annoying but not a huge deal IMHO). I feel like there is a reason that Node/Expr are structured this way in the first place and concerned about that fitting with matchpy shadows the reason/benefit that Node/Expr are structured this way in the first place.
I also wonder how bad it is to implement the matchpy based system without this change since I am not familiar with matchpy myself. Or is it impossible without this?
I am supportive of improving the project health long term since our critical infrastructure depend on it, but honestly we spent quite a bit of time on dealing with ibis refactors this year and this one seem to be even much more breaking so just wonder (1) if this is the right thing to do and (2) if so, whether there is a less aggressive way to approach this.
I feel like there is a reason that Node/Expr are structured this way in the first place and concerned about that fitting with matchpy shadows the reason/benefit that Node/Expr are structured this way in the first place.
The reason the separation exists is to keep the API and internal IR separate, which remains true in this PR. Exprs aren't going away, they're going away internally. After this PR we're only dealing with ops inside of ibis (there are some cases where that isn't true in the current PR, and we'll need to address that before merging this).
I also wonder how bad it is to implement the matchpy based system without this change since I am not familiar with matchpy myself. Or is it impossible without this?
It's not related to matchpy per se. We're considering alternatives like kanren. Ultimately we need a single set of types for some pattern matching library to operate on to keep the integration low on hacks.
we spent quite a bit of time on dealing with ibis refactors this year
Sorry to hear that. We're happy to help with that. For example, if there was a breaking change to a public API in a non-major release, we'd like to know about it. I appreciate the PRs and issues y'all have opened too, keep that coming :)
if so, whether there is a less aggressive way to approach this.
Hard to say, but no other alternatives are proposed and there's a finite amount of time we can spend here.
I think the requirements for an alternative are a path towards the following:
- ability to remove or fully replace analysis.py
- no expression API breakage
- easily-extensible rewrite system for relational algebra rewrites
- predicate pushdown with less code and ideally no API regressions
- column pruning with less code and ideally no API regressions
We're not planning to release this until 4.0.0, and there'll be a 3.1 release before that (in the next couple weeks) so there's some time to try things out and give feedback before we merge this.
@cpcloud this is a really aggressive change and will likey blow up the world. Can you explain the rationale
@cpcloud this is a really aggressive change and will likey blow up the world. Can you explain the rationale
See all the text above. Happy to answer any specific questions or concerns.
i read the text above and it doesn't answer the question of why here
in any event please don't merge these u til we have a discussion
It'll be a bit before we merge it.
Are there any specific questions I can answer?
well see @icexelloss yhis seems like a huge amount of churn for a very small gain
so am -1 at this point
I feel like there is a reason that
Node/Exprare structured this way in the first place and concerned about that fitting withmatchpyshadows the reason/benefit thatNode/Exprare structured this way in the first place.
The main reason to separate them is to provide a typed user facing API and syntax sugar. We need to bound different API methods for differently typed value operations.
Historically this difference hasn't been that clear since Nodes/Operations were used as factories to instantiate Expression objects. At that time Expression classes held op, name and datatype properties, so they acted as an extension on top of operation instances. The datatype information was directly available from the underlying operation and name was the only data attribute held by Expressions which wasn't available directly from the operation. Since https://github.com/ibis-project/ibis/pull/3276 we managed move the name property to the operations by introducing an [Value]Alias node, meaning that expressions don't store anything but the underlying operation. This means that Expressions hold just as much information as the operation they wrap which makes them a redundant abstraction - in regard of internal representation.
The idea here is to unify the ibis internal representation containing only ops.Node instances and wrap them with a single Expr for user consumption (by simply calling node.to_expr()).
I also wonder how bad it is to implement the
matchpybased system without this change since I am not familiar withmatchpymyself. Or is it impossible without this?
Matchpy requires an object hierarchy of the same type, so I din't think it's possible without this change. I have a matchpy prototype implementation, but I wasn't able to hook it up while having intermediate expressions.
Other than this, are there other reason that the
Node/Exprstructure are bad (I understand the annoyance of having to jump back and forth in various places, which is annoying but not a huge deal IMHO).
I disagree, the duality of the type hierarchy introduces a serious confusion resulting leaky abstractions and various workarounds specifically applicable for some specific use cases (as an example see @cpcloud's collection of bugs above).
I think it's crucial for ibis's long term success to maintain a simple to understand and strict internal representation with generic utilities (for traversing, matching and replacing nodes in the hierarchy). Otherwise the technical dept we have in ibis is going to block future developments.
I am supportive of improving the project health long term since our critical infrastructure depend on it, but honestly we spent quite a bit of time on dealing with ibis refactors this year and this one seem to be even much more breaking so just wonder (1) if this is the right thing to do and (2) if so, whether there is a less aggressive way to approach this.
The reason of all those refactors was to reach a point where we can actually enable this refactor. Lately we merged several PRs aiming to simplify the analysis code in order to understand what happens in those routines at the first place, so this is a "less aggressive" way. I cannot really think of a better approach, but I'm totally open to suggestions.
in any event please don't merge these u til we have a discussion
Sure, I don't have any intention to merge this change until we have a clear consensus on how to (or not to) proceed.
Test Results
35 files 35 suites 58m 18s :stopwatch: 8 301 tests 6 098 :heavy_check_mark: 1 648 :zzz: 550 :x: 5 :fire: 29 155 runs 18 522 :heavy_check_mark: 6 203 :zzz: 4 390 :x: 40 :fire:
For more details on these failures and errors, see this check.
Results for commit e3b20392.
:recycle: This comment has been updated with latest results.
closing in favor of #4219