Both the core and sqlite tests pass now, result which is a pretty good baseline to work with.

Jun 22 '22 16:06 kszucs

Hello @kszucs! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file ibis/backends/base/sql/compiler/translator.py:

Line 379:40: F821 undefined name 'to'

In the file ibis/backends/base/sql/registry/string.py:

Line 36:35: F821 undefined name 'start'

In the file ibis/expr/lineage.py:

Line 136:1: F723 syntax error in type comment 'type = ir.Expr,' Line 146:80: E501 line too long (81 > 79 characters)

In the file ibis/expr/operations/core.py:

Line 8:1: F401 'ibis.expr.types as ir' imported but unused

In the file ibis/expr/operations/generic.py:

Line 21:1: F401 'ibis.expr.operations.core.Node' imported but unused

In the file ibis/expr/operations/relations.py:

Line 138:80: E501 line too long (87 > 79 characters) Line 153:80: E501 line too long (83 > 79 characters) Line 157:80: E501 line too long (82 > 79 characters) Line 158:80: E501 line too long (82 > 79 characters)

In the file ibis/expr/rules.py:

Line 3:1: F401 'typing.Any' imported but unused

In the file ibis/expr/types/core.py:

Line 19:1: F401 'ibis.common.grounds.Base' imported but unused

Comment last updated at 2022-06-22 17:50:47 UTC

Jun 22 '22 16:06 pep8speaks

Unit Test Results

      35 files       35 suites 1h 3m 47s :stopwatch:   8 597 tests   4 837 :heavy_check_mark: 1 654 :zzz:   1 863 :x: 243 :fire: 31 523 runs 14 208 :heavy_check_mark: 6 285 :zzz: 10 769 :x: 261 :fire:

For more details on these failures and errors, see this check.

Results for commit 5800bb3d.

:recycle: This comment has been updated with latest results.

Jun 22 '22 17:06 github-actions[bot]

Codecov Report

Merging #4135 (cff3d1e) into 4.x.x (bd75053) will decrease coverage by 8.11%. The diff coverage is 87.70%.

:exclamation: Current head cff3d1e differs from pull request most recent head 80e110e. Consider uploading reports for the commit 80e110e to get more accurate results

@@            Coverage Diff             @@
##            4.x.x    #4135      +/-   ##
==========================================
- Coverage   93.03%   84.92%   -8.12%     
==========================================
  Files         158      157       -1     
  Lines       18346    17702     -644     
  Branches     2670     2630      -40     
==========================================
- Hits        17069    15033    -2036     
- Misses        960     2339    +1379     
- Partials      317      330      +13

Impacted Files	Coverage Δ
ibis/backends/duckdb/datatypes.py	`60.41% <ø> (-39.59%)`	:arrow_down:
ibis/backends/impala/client.py	`0.00% <0.00%> (-87.83%)`	:arrow_down:
ibis/backends/impala/compiler.py	`0.00% <0.00%> (-100.00%)`	:arrow_down:
ibis/backends/impala/pandas_interop.py	`0.00% <0.00%> (-91.31%)`	:arrow_down:
ibis/backends/mysql/registry.py	`0.00% <0.00%> (-96.08%)`	:arrow_down:
ibis/config.py	`85.71% <ø> (+9.00%)`	:arrow_up:
ibis/backends/base/sql/registry/window.py	`27.27% <25.92%> (-61.83%)`	:arrow_down:
ibis/backends/duckdb/registry.py	`27.90% <47.36%> (-64.32%)`	:arrow_down:
ibis/backends/base/sql/registry/aggregate.py	`77.27% <55.55%> (-22.73%)`	:arrow_down:
ibis/udf/vectorized.py	`81.17% <55.55%> (-12.50%)`	:arrow_down:
... and 122 more

Jun 27 '22 21:06 codecov[bot]

stumbled upon this..

I wonder if the main idea to use Node as arguments for Node (instead of using Expr)? This looks like a pretty aggressive change and I wonder what's the main driving reason for this?

Jun 29 '22 21:06 icexelloss

@icexelloss Thanks for chiming in!

There are a few reasons why we're undertaking this change.

We want to tackle big ticket items all of which are made unreasonably difficult by the current indirection between Exprs and Nodes.

Here's an example that's been top of mind for us recently.

Replacing the functionality in `analysis.py` with a rule based rewrite system

I prototyped the beginnings of such as system here based on pattern matching using matchpy.

Bugs in analysis.py have the worst combination of effects: they tend to propagate to many parts of ibis unpredictably and are difficult to debug. The code is hard to reason about because of the way its current pattern matching is implemented. The rules are encoded ad-hoc as complex conditionals and the actions taken when the conditionals are True are difficult to follow.

Here's a subset of issues and pull requests directly related to analysis.py:

#1045
#1387
#2427
#2830
#3341
#3357
#3577
#4005

To implement this pattern matching based system, we need to canonicalize the internal representation to Node, and avoid jumping back and forth internally between Expr and Node.

Here a few outcomes that we're targeting for the end state of these refactors:

analysis.py is completely subsumed by the new system
End-user APIs require no changes (APIs based on use of internals like Node subclasses will unfortunately break)
Generated SQL code is nearly as good as it would be if written by hand. For example, no unnecessary projections, fused and pushed-down predicates, column/projection pruning. In some cases this will help backends better optimize plans resulting in both more readable target code and faster execution.
Conveniences like #2703 can be encoded by maintainers with relative ease as long as a pattern can be written to describe the transformation.

Jun 30 '22 13:06 cpcloud

Thanks @cpcloud. I agree that analysis.py is bad and happy to see some effort to replace it. However, I do have some concern and questions about this.

To implement this pattern matching based system, we need to canonicalize the internal representation to Node, and avoid jumping back and forth internally between Expr and Node

Other than this, are there other reason that the Node/Expr structure are bad (I understand the annoyance of having to jump back and forth in various places, which is annoying but not a huge deal IMHO). I feel like there is a reason that Node/Expr are structured this way in the first place and concerned about that fitting with matchpy shadows the reason/benefit that Node/Expr are structured this way in the first place.

I also wonder how bad it is to implement the matchpy based system without this change since I am not familiar with matchpy myself. Or is it impossible without this?

I am supportive of improving the project health long term since our critical infrastructure depend on it, but honestly we spent quite a bit of time on dealing with ibis refactors this year and this one seem to be even much more breaking so just wonder (1) if this is the right thing to do and (2) if so, whether there is a less aggressive way to approach this.

Jun 30 '22 15:06 icexelloss

I feel like there is a reason that Node/Expr are structured this way in the first place and concerned about that fitting with matchpy shadows the reason/benefit that Node/Expr are structured this way in the first place.

The reason the separation exists is to keep the API and internal IR separate, which remains true in this PR. Exprs aren't going away, they're going away internally. After this PR we're only dealing with ops inside of ibis (there are some cases where that isn't true in the current PR, and we'll need to address that before merging this).

I also wonder how bad it is to implement the matchpy based system without this change since I am not familiar with matchpy myself. Or is it impossible without this?

It's not related to matchpy per se. We're considering alternatives like kanren. Ultimately we need a single set of types for some pattern matching library to operate on to keep the integration low on hacks.

we spent quite a bit of time on dealing with ibis refactors this year

Sorry to hear that. We're happy to help with that. For example, if there was a breaking change to a public API in a non-major release, we'd like to know about it. I appreciate the PRs and issues y'all have opened too, keep that coming :)

if so, whether there is a less aggressive way to approach this.

Hard to say, but no other alternatives are proposed and there's a finite amount of time we can spend here.

I think the requirements for an alternative are a path towards the following:

ability to remove or fully replace analysis.py
no expression API breakage
easily-extensible rewrite system for relational algebra rewrites
predicate pushdown with less code and ideally no API regressions
column pruning with less code and ideally no API regressions

We're not planning to release this until 4.0.0, and there'll be a 3.1 release before that (in the next couple weeks) so there's some time to try things out and give feedback before we merge this.

Jun 30 '22 16:06 cpcloud

@cpcloud this is a really aggressive change and will likey blow up the world. Can you explain the rationale

Jul 01 '22 13:07 jreback

@cpcloud this is a really aggressive change and will likey blow up the world. Can you explain the rationale

See all the text above. Happy to answer any specific questions or concerns.

Jul 01 '22 13:07 cpcloud

i read the text above and it doesn't answer the question of why here

in any event please don't merge these u til we have a discussion

Jul 01 '22 22:07 jreback

It'll be a bit before we merge it.

Are there any specific questions I can answer?

Jul 01 '22 23:07 cpcloud

well see @icexelloss yhis seems like a huge amount of churn for a very small gain

so am -1 at this point

Jul 01 '22 23:07 jreback

I feel like there is a reason that Node/Expr are structured this way in the first place and concerned about that fitting with matchpy shadows the reason/benefit that Node/Expr are structured this way in the first place.

The main reason to separate them is to provide a typed user facing API and syntax sugar. We need to bound different API methods for differently typed value operations.

Historically this difference hasn't been that clear since Nodes/Operations were used as factories to instantiate Expression objects. At that time Expression classes held op, name and datatype properties, so they acted as an extension on top of operation instances. The datatype information was directly available from the underlying operation and name was the only data attribute held by Expressions which wasn't available directly from the operation. Since https://github.com/ibis-project/ibis/pull/3276 we managed move the name property to the operations by introducing an [Value]Alias node, meaning that expressions don't store anything but the underlying operation. This means that Expressions hold just as much information as the operation they wrap which makes them a redundant abstraction - in regard of internal representation.

The idea here is to unify the ibis internal representation containing only ops.Node instances and wrap them with a single Expr for user consumption (by simply calling node.to_expr()).

I also wonder how bad it is to implement the matchpy based system without this change since I am not familiar with matchpy myself. Or is it impossible without this?

Matchpy requires an object hierarchy of the same type, so I din't think it's possible without this change. I have a matchpy prototype implementation, but I wasn't able to hook it up while having intermediate expressions.

Other than this, are there other reason that the Node/Expr structure are bad (I understand the annoyance of having to jump back and forth in various places, which is annoying but not a huge deal IMHO).

I disagree, the duality of the type hierarchy introduces a serious confusion resulting leaky abstractions and various workarounds specifically applicable for some specific use cases (as an example see @cpcloud's collection of bugs above).

I think it's crucial for ibis's long term success to maintain a simple to understand and strict internal representation with generic utilities (for traversing, matching and replacing nodes in the hierarchy). Otherwise the technical dept we have in ibis is going to block future developments.

I am supportive of improving the project health long term since our critical infrastructure depend on it, but honestly we spent quite a bit of time on dealing with ibis refactors this year and this one seem to be even much more breaking so just wonder (1) if this is the right thing to do and (2) if so, whether there is a less aggressive way to approach this.

The reason of all those refactors was to reach a point where we can actually enable this refactor. Lately we merged several PRs aiming to simplify the analysis code in order to understand what happens in those routines at the first place, so this is a "less aggressive" way. I cannot really think of a better approach, but I'm totally open to suggestions.

in any event please don't merge these u til we have a discussion

Sure, I don't have any intention to merge this change until we have a clear consensus on how to (or not to) proceed.

Jul 04 '22 12:07 kszucs

Test Results

      35 files       35 suites 58m 18s :stopwatch:   8 301 tests   6 098 :heavy_check_mark: 1 648 :zzz:   550 :x:   5 :fire: 29 155 runs 18 522 :heavy_check_mark: 6 203 :zzz: 4 390 :x: 40 :fire:

For more details on these failures and errors, see this check.

Results for commit e3b20392.

:recycle: This comment has been updated with latest results.

Jul 10 '22 08:07 github-actions[bot]

closing in favor of #4219

Aug 20 '22 12:08 kszucs

ibis
ibis copied to clipboard

refactor: remove intermediate expressions

Comment last updated at 2022-06-22 17:50:47 UTC

Unit Test Results

Codecov Report

Replacing the functionality in `analysis.py` with a rule based rewrite system

Test Results

ibis ibis copied to clipboard

refactor: remove intermediate expressions

Comment last updated at 2022-06-22 17:50:47 UTC

Unit Test Results

Codecov Report

Replacing the functionality in analysis.py with a rule based rewrite system

Test Results

ibis
ibis copied to clipboard

Replacing the functionality in `analysis.py` with a rule based rewrite system