calcite icon indicating copy to clipboard operation
calcite copied to clipboard

[CALCITE-7031] Implement the general decorrelation algorithm (Neumann & Kemper)

Open silundong opened this issue 2 months ago • 30 comments

JIRA

silundong avatar Nov 06 '25 06:11 silundong

Thank you for your review. Regarding the D relations, in the code it is the dedupFreeVarsNode, which is generated by calling the generateDedupFreeVarsNode after decorrelating the left input of the Correlate, and builder.distinct is used there to ensure it is duplicate free. This is still a draft. Should this work continue? If so, I will continue to refine the code and comments.

silundong avatar Nov 07 '25 09:11 silundong

builder.distinct is used there to ensure it is duplicate free.

builder.distinct ensures that it is distinct, but in general the functions that do the rewrite do not know that there is a distinct somewhere deep in the tree. My question is whether there should be some information that promises that the rel node passed to one of these functions is guaranteed to be distinct. Maybe the fact that the functions are private is enough, since all callers guarantee this.

mihaibudiu avatar Nov 07 '25 17:11 mihaibudiu

This is still a draft. Should this work continue?

I think this is great. If it holds the promise to solve many of the current decorrelator limits, we should adopt it. Do you see any downsides?

As I commented elsewhere, one problem I foresee is that the UNNEST array operation cannot be represented without a Correlate node in general, so a new Rel node may need to be introduced to be able to handle such plans. But this can also be done later.

mihaibudiu avatar Nov 07 '25 17:11 mihaibudiu

Perhaps wrapping the D (dedupFreeVarsNode) to make it immutable after creation would be better. Did I understand you correctly?

Do you see any downsides?

The current draft implements the general decorrelation approach. For some very simple cases, the paper also proposes a simpler decorrelation that would yield cleaner plans. It seems that matching some fixed patterns using rules would suffice. I think the simple decorrelation approach can be considered later.

silundong avatar Nov 08 '25 08:11 silundong

@silundong this is great work, I hope you can advance it. I think that covering all cases is not necessary in the first implementation, but having an extensible framework where new operators can be handled will enable more cases to be handled later.

mihaibudiu avatar Nov 19 '25 16:11 mihaibudiu

Thank you! I'll continue to advance and address the comments. I should be able to complete the updates in the next few days. This PR doesn't include the enumerable implementation for mark join yet, I feel this PR is already quite large in scope. Should I complete it in this pr, or would it be better to create a new JIRA issue for it? I'm happy to work on it.

silundong avatar Nov 20 '25 01:11 silundong

No, I think that splitting this into multiple PRs will make it manageable for the reviewers; even reviewing this first piece will be non-trivial. Ideally the infra changes, which add all the new classes are separate, and then additional PRs can be added to handle various operators.

mihaibudiu avatar Nov 20 '25 01:11 mihaibudiu

After some consideration, I believe it’s necessary to split this issue into multiple subtasks. To address the problem of decorrelating boolean context IN or existential subqueries directly into SEMI/ANTI joins for CALCITE-3373, we can adopt an approach similar to that of Umbra (https://umbra-db.com/interface/ the implementation system described in related papers). The steps would be: first convert some IN or existential subqueries into Mark Joins—given Mark Join’s strong expressiveness—then optimize them into SEMI/ANTI joins via rules, and finally hand them over to RelDecorrelator for processing. I think introducing Mark Join will be highly useful for both the existing RelDecorrelator and the proposed TopDownGeneralDecorrelator we intend to implement. Furthermore, this approach of introducing Mark Join is much more elegant than the handling method in https://github.com/apache/calcite/pull/4211. Additionally, Mark Join will only serve as an internal representation within Calcite: it will either be simplified into SEMI/ANTI joins or, if simplification is not feasible, converted into Calcite’s existing literal_agg expression through transformation rules. This process will be completely transparent to Calcite users. In my opinion, it is more cost-effective to first introduce Mark Join, and then consider whether to implement the new TopDownGeneralDecorrelator framework at a later stage.

iwanttobepowerful avatar Nov 20 '25 13:11 iwanttobepowerful

WITH 
  dept(dname, deptno, loc) AS (
 VALUES 
   ('ACCOUNTING', 10, 'NEW YORK'),
   ('RESEARCH', 20, 'DALLAS'),
   ('SALES', 30, 'CHICAGO'),
   ('OPERATIONS', 40, 'BOSTON')
  ),
  emp(empno, ename, job, mgr, hiredate, sal, comm, deptno) AS (
 VALUES 
   (7369, 'SMITH', 'CLERK', 7902, '1980-12-17'::date, 800.00, NULL::numeric, 20),
   (7499, 'ALLEN', 'SALESMAN', 7698, '1981-02-20'::date, 1600.00, 300.00, 30),
   (7521, 'WARD', 'SALESMAN', 7698, '1981-02-22'::date, 1250.00, 500.00, 30),
   (7566, 'JONES', 'MANAGER', 7839, '1981-02-04'::date, 2975.00, NULL, 20),
   (7654, 'MARTIN', 'SALESMAN', 7698, '1981-09-28'::date, 1250.00, 1400.00, 30),
   (7698, 'BLAKE', 'MANAGER', 7839, '1981-01-05'::date, 2850.00, NULL, 30),
   (7782, 'CLARK', 'MANAGER', 7839, '1981-06-09'::date, 2450.00, NULL, 10),
   (7788, 'SCOTT', 'ANALYST', 7566, '1987-04-19'::date, 3000.00, NULL, 20),
   (7839, 'KING', 'PRESIDENT', NULL, '1981-11-17'::date, 5000.00, NULL, 10),
   (7844, 'TURNER', 'SALESMAN', 7698, '1981-09-08'::date, 1500.00, 0.00, 30),
   (7876, 'ADAMS', 'CLERK', 7788, '1987-05-23'::date, 1100.00, NULL, 20),
   (7900, 'JAMES', 'CLERK', 7698, '1981-12-03'::date, 950.00, NULL, 30),
   (7902, 'FORD', 'ANALYST', 7566, '1981-12-03'::date, 3000.00, NULL, 20),
   (7934, 'MILLER', 'CLERK', 7782, '1982-01-23'::date, 1300.00, NULL, 10)
  ),
  bonus(ENAME, JOB, SAL, COMM) AS (
 VALUES 
 ('ALLEN', 'SALESMAN', 1600.00, 300.00),
 ('WARD', 'SALESMAN', 1250.00, 500.00)
  )
select count(*) as c
from emp as e
where sal + 100 not in (
  select deptno
  from dept
  where dname = e.ename);

Input the above SQL statement into umbra-db interface, click Execute, and check the optimization steps one by one. I believe it will be an excellent reference for us. Clipboard_Screenshot_1763644354 Clipboard_Screenshot_1763645122

iwanttobepowerful avatar Nov 20 '25 13:11 iwanttobepowerful

I will do my best to review this, but I expect it won't be easy or very fast. Thank you.

mihaibudiu avatar Nov 21 '25 06:11 mihaibudiu

This commit completes the comments, introduces CorrelatePlus to support the condition attribute, and implements optimizations for D elimination. The test cases look good. I think the PR is now ready for review. @mihaibudiu PTAL, Thank you!

silundong avatar Nov 21 '25 06:11 silundong

The commit includes:

  1. Optimize the HepProgram configuration before and after general decorrelation.
  2. Fix a correlation detection bug in general decorrelation based on @iwanttobepowerful finding in CALCITE-7303. Thanks for the good catch.
  3. Add test cases.

silundong avatar Nov 24 '25 06:11 silundong

2. finding in CALCITE-7303.

I identified this bug because the implementation of RelDecorrelator.java fails to cover certain scenarios. Unexpectedly, it also exposed an issue in your implementation. The more I delve into it, the more I find that RelDecorrelator and TopDownGeneralDecorrelator share striking similarities in their core design philosophies.

iwanttobepowerful avatar Nov 24 '25 09:11 iwanttobepowerful

Thank you for the detailed review! I will update in the next few days.

silundong avatar Nov 26 '25 06:11 silundong

This commit mainly improves code style and comments. PTAL, thank you!

silundong avatar Dec 01 '25 10:12 silundong

Is it a good time to create a new ticket to implement LEFT MARK JOIN in the Enumerable convention? Should this PR wait for that implementation before being merged? Please let me know your preference. :)

silundong avatar Dec 04 '25 10:12 silundong

Is it a good time to create a new ticket to implement LEFT MARK JOIN in the Enumerable convention? Should this PR wait for that implementation before being merged? Please let me know your preference. :)

The PR is extremely helpful, if that's the only blocker (and IIRC the use of this new decorrelator is totally optional) I wouldn't block on that

asolimando avatar Dec 04 '25 16:12 asolimando

Is it a good time to create a new ticket to implement LEFT MARK JOIN in the Enumerable convention? Should this PR wait for that implementation before being merged? Please let me know your preference. :)

As @asolimando says, this PR is already big enough, so you may as well add it in a separate PR, but then it means you can file the issue now about doing it.

mihaibudiu avatar Dec 04 '25 22:12 mihaibudiu

I have already approved, so from my point of view feel free to merge after the conflicts are resolved and the commits are squashed. If anyone else plans to submit a review, please let us know to wait before merging.

mihaibudiu avatar Dec 04 '25 22:12 mihaibudiu

This PR looks ready to me. Would it be possible to merge it to advance CALCITE-7315? Thank you! @mihaibudiu @asolimando @xiedeyantu

silundong avatar Dec 11 '25 07:12 silundong

Can you write some documentation someplace (could be JavaDoc or JIRA) about how one should proceed to replace the old decorrelator with this one?

Do you have a plan to migrate other tests to also use this decorrelator somehow?

mihaibudiu avatar Dec 11 '25 07:12 mihaibudiu

Can you write some documentation someplace (could be JavaDoc or JIRA) about how one should proceed to replace the old decorrelator with this one?

Sure, I will add some JavaDoc in TopDownDecorrelator.java.

From my observation, the most commonly used tests are RelOptRulesTest and the Quidem tests. RelOptRulesTest can already switch the decorrelator via configuration. Quidem tests involves the entire SQL execution. After completing the LEFT MARK join under the Enumerable convention, may be we can integrate the new decorrelator into the Quidem tests and make the switch configurable in the iq files.

silundong avatar Dec 11 '25 09:12 silundong

If needed, I can help enable the new decorrelator to support quidem tests.

xiedeyantu avatar Dec 11 '25 11:12 xiedeyantu

One way to migrate tests is to actually subclass the class which runs the tests and run the same tests using a different configuration/fixture.

For xml based tests (RelOptRulesTest) this may work well, since a new XML file will be associated with the new class.

A big problem with quidem will be that many tests include plans, many of which will change with the new decorrelator. I personally think that including plans in quidem tests should be avoided, exactly for this reason: plans are sensitive to the optimizations applied, whereas the quidem tests are designed to check end-to-end correctness.

mihaibudiu avatar Dec 11 '25 17:12 mihaibudiu

A big problem with quidem will be that many tests include plans

How about disabling !plan in the Quidem test files, and moving the existing test cases with !plan to either RelOptRuleTest or RelDecorrelatorTest?

iwanttobepowerful avatar Dec 12 '25 06:12 iwanttobepowerful

Maybe !plan could be somehow ignored only for the fixture that uses the new decorrelator. "disabling" will require removing all the plans from the files, and probably some of the plans are useful.

mihaibudiu avatar Dec 12 '25 07:12 mihaibudiu

Anyway, I think we should merge this PR, but we should also make sure that people know how to use the new decorrelator so it gets tested and improved. That's what I am waiting for: some instructions on using the new decorrelator.

For example, in our compiler we use withExpand(true) in the SqlToRelConverter. Can the new decorrelator work with both isExpand(true) and isExpand(false)? There may be other similar questions.

mihaibudiu avatar Dec 12 '25 07:12 mihaibudiu

I've added usage notes for TopDownGeneralDecorrelator. PTAL, thank you!

silundong avatar Dec 12 '25 08:12 silundong

I think the linter is upset about the new javadoc format.

mihaibudiu avatar Dec 13 '25 18:12 mihaibudiu