[CALCITE-7031] Implement the general decorrelation algorithm (Neumann & Kemper)
Thank you for your review. Regarding the D relations, in the code it is the dedupFreeVarsNode, which is generated by calling the generateDedupFreeVarsNode after decorrelating the left input of the Correlate, and builder.distinct is used there to ensure it is duplicate free.
This is still a draft. Should this work continue? If so, I will continue to refine the code and comments.
builder.distinct is used there to ensure it is duplicate free.
builder.distinct ensures that it is distinct, but in general the functions that do the rewrite do not know that there is a distinct somewhere deep in the tree. My question is whether there should be some information that promises that the rel node passed to one of these functions is guaranteed to be distinct. Maybe the fact that the functions are private is enough, since all callers guarantee this.
This is still a draft. Should this work continue?
I think this is great. If it holds the promise to solve many of the current decorrelator limits, we should adopt it. Do you see any downsides?
As I commented elsewhere, one problem I foresee is that the UNNEST array operation cannot be represented without a Correlate node in general, so a new Rel node may need to be introduced to be able to handle such plans. But this can also be done later.
Perhaps wrapping the D (dedupFreeVarsNode) to make it immutable after creation would be better. Did I understand you correctly?
Do you see any downsides?
The current draft implements the general decorrelation approach. For some very simple cases, the paper also proposes a simpler decorrelation that would yield cleaner plans. It seems that matching some fixed patterns using rules would suffice. I think the simple decorrelation approach can be considered later.
@silundong this is great work, I hope you can advance it. I think that covering all cases is not necessary in the first implementation, but having an extensible framework where new operators can be handled will enable more cases to be handled later.
Thank you! I'll continue to advance and address the comments. I should be able to complete the updates in the next few days. This PR doesn't include the enumerable implementation for mark join yet, I feel this PR is already quite large in scope. Should I complete it in this pr, or would it be better to create a new JIRA issue for it? I'm happy to work on it.
No, I think that splitting this into multiple PRs will make it manageable for the reviewers; even reviewing this first piece will be non-trivial. Ideally the infra changes, which add all the new classes are separate, and then additional PRs can be added to handle various operators.
After some consideration, I believe it’s necessary to split this issue into multiple subtasks. To address the problem of decorrelating boolean context IN or existential subqueries directly into SEMI/ANTI joins for CALCITE-3373, we can adopt an approach similar to that of Umbra (https://umbra-db.com/interface/ the implementation system described in related papers). The steps would be: first convert some IN or existential subqueries into Mark Joins—given Mark Join’s strong expressiveness—then optimize them into SEMI/ANTI joins via rules, and finally hand them over to RelDecorrelator for processing. I think introducing Mark Join will be highly useful for both the existing RelDecorrelator and the proposed TopDownGeneralDecorrelator we intend to implement. Furthermore, this approach of introducing Mark Join is much more elegant than the handling method in https://github.com/apache/calcite/pull/4211. Additionally, Mark Join will only serve as an internal representation within Calcite: it will either be simplified into SEMI/ANTI joins or, if simplification is not feasible, converted into Calcite’s existing literal_agg expression through transformation rules. This process will be completely transparent to Calcite users. In my opinion, it is more cost-effective to first introduce Mark Join, and then consider whether to implement the new TopDownGeneralDecorrelator framework at a later stage.
WITH
dept(dname, deptno, loc) AS (
VALUES
('ACCOUNTING', 10, 'NEW YORK'),
('RESEARCH', 20, 'DALLAS'),
('SALES', 30, 'CHICAGO'),
('OPERATIONS', 40, 'BOSTON')
),
emp(empno, ename, job, mgr, hiredate, sal, comm, deptno) AS (
VALUES
(7369, 'SMITH', 'CLERK', 7902, '1980-12-17'::date, 800.00, NULL::numeric, 20),
(7499, 'ALLEN', 'SALESMAN', 7698, '1981-02-20'::date, 1600.00, 300.00, 30),
(7521, 'WARD', 'SALESMAN', 7698, '1981-02-22'::date, 1250.00, 500.00, 30),
(7566, 'JONES', 'MANAGER', 7839, '1981-02-04'::date, 2975.00, NULL, 20),
(7654, 'MARTIN', 'SALESMAN', 7698, '1981-09-28'::date, 1250.00, 1400.00, 30),
(7698, 'BLAKE', 'MANAGER', 7839, '1981-01-05'::date, 2850.00, NULL, 30),
(7782, 'CLARK', 'MANAGER', 7839, '1981-06-09'::date, 2450.00, NULL, 10),
(7788, 'SCOTT', 'ANALYST', 7566, '1987-04-19'::date, 3000.00, NULL, 20),
(7839, 'KING', 'PRESIDENT', NULL, '1981-11-17'::date, 5000.00, NULL, 10),
(7844, 'TURNER', 'SALESMAN', 7698, '1981-09-08'::date, 1500.00, 0.00, 30),
(7876, 'ADAMS', 'CLERK', 7788, '1987-05-23'::date, 1100.00, NULL, 20),
(7900, 'JAMES', 'CLERK', 7698, '1981-12-03'::date, 950.00, NULL, 30),
(7902, 'FORD', 'ANALYST', 7566, '1981-12-03'::date, 3000.00, NULL, 20),
(7934, 'MILLER', 'CLERK', 7782, '1982-01-23'::date, 1300.00, NULL, 10)
),
bonus(ENAME, JOB, SAL, COMM) AS (
VALUES
('ALLEN', 'SALESMAN', 1600.00, 300.00),
('WARD', 'SALESMAN', 1250.00, 500.00)
)
select count(*) as c
from emp as e
where sal + 100 not in (
select deptno
from dept
where dname = e.ename);
Input the above SQL statement into umbra-db interface, click Execute, and check the optimization steps one by one. I believe it will be an excellent reference for us.
I will do my best to review this, but I expect it won't be easy or very fast. Thank you.
This commit completes the comments, introduces CorrelatePlus to support the condition attribute, and implements optimizations for D elimination. The test cases look good. I think the PR is now ready for review. @mihaibudiu PTAL, Thank you!
The commit includes:
- Optimize the
HepProgramconfiguration before and after general decorrelation. - Fix a correlation detection bug in general decorrelation based on @iwanttobepowerful finding in CALCITE-7303. Thanks for the good catch.
- Add test cases.
2. finding in CALCITE-7303.
I identified this bug because the implementation of RelDecorrelator.java fails to cover certain scenarios. Unexpectedly, it also exposed an issue in your implementation. The more I delve into it, the more I find that RelDecorrelator and TopDownGeneralDecorrelator share striking similarities in their core design philosophies.
Thank you for the detailed review! I will update in the next few days.
This commit mainly improves code style and comments. PTAL, thank you!
Is it a good time to create a new ticket to implement LEFT MARK JOIN in the Enumerable convention? Should this PR wait for that implementation before being merged? Please let me know your preference. :)
Is it a good time to create a new ticket to implement LEFT MARK JOIN in the Enumerable convention? Should this PR wait for that implementation before being merged? Please let me know your preference. :)
The PR is extremely helpful, if that's the only blocker (and IIRC the use of this new decorrelator is totally optional) I wouldn't block on that
Is it a good time to create a new ticket to implement LEFT MARK JOIN in the Enumerable convention? Should this PR wait for that implementation before being merged? Please let me know your preference. :)
As @asolimando says, this PR is already big enough, so you may as well add it in a separate PR, but then it means you can file the issue now about doing it.
I have already approved, so from my point of view feel free to merge after the conflicts are resolved and the commits are squashed. If anyone else plans to submit a review, please let us know to wait before merging.
This PR looks ready to me. Would it be possible to merge it to advance CALCITE-7315? Thank you! @mihaibudiu @asolimando @xiedeyantu
Can you write some documentation someplace (could be JavaDoc or JIRA) about how one should proceed to replace the old decorrelator with this one?
Do you have a plan to migrate other tests to also use this decorrelator somehow?
Can you write some documentation someplace (could be JavaDoc or JIRA) about how one should proceed to replace the old decorrelator with this one?
Sure, I will add some JavaDoc in TopDownDecorrelator.java.
From my observation, the most commonly used tests are RelOptRulesTest and the Quidem tests. RelOptRulesTest can already switch the decorrelator via configuration. Quidem tests involves the entire SQL execution. After completing the LEFT MARK join under the Enumerable convention, may be we can integrate the new decorrelator into the Quidem tests and make the switch configurable in the iq files.
If needed, I can help enable the new decorrelator to support quidem tests.
One way to migrate tests is to actually subclass the class which runs the tests and run the same tests using a different configuration/fixture.
For xml based tests (RelOptRulesTest) this may work well, since a new XML file will be associated with the new class.
A big problem with quidem will be that many tests include plans, many of which will change with the new decorrelator. I personally think that including plans in quidem tests should be avoided, exactly for this reason: plans are sensitive to the optimizations applied, whereas the quidem tests are designed to check end-to-end correctness.
A big problem with quidem will be that many tests include plans
How about disabling !plan in the Quidem test files, and moving the existing test cases with !plan to either RelOptRuleTest or RelDecorrelatorTest?
Maybe !plan could be somehow ignored only for the fixture that uses the new decorrelator. "disabling" will require removing all the plans from the files, and probably some of the plans are useful.
Anyway, I think we should merge this PR, but we should also make sure that people know how to use the new decorrelator so it gets tested and improved. That's what I am waiting for: some instructions on using the new decorrelator.
For example, in our compiler we use withExpand(true) in the SqlToRelConverter. Can the new decorrelator work with both isExpand(true) and isExpand(false)? There may be other similar questions.
I've added usage notes for TopDownGeneralDecorrelator. PTAL, thank you!
I think the linter is upset about the new javadoc format.
Quality Gate passed
Issues
11 New issues
0 Accepted issues
Measures
0 Security Hotspots
87.7% Coverage on New Code
0.0% Duplication on New Code