[CALCITE-7122] Eliminate nested calls for idempotent unary functions UPPER/LOWER/ABS/INITCAP/CEIL/FLOOR
jira: https://issues.apache.org/jira/browse/CALCITE-7122
Can we take this discussion to Jira? We need to have a conversation about specification. Those conversations belong in Jira.
Also please STOP logging a jira case just the moment before you push a PR. We know that you have been working on the PR for a few days. Give the rest of us chance to weigh in on the specification before you write the implementation.
Please fix the reference to the jira case. Is it 7122 or 7112?
Can we take this discussion to Jira? We need to have a conversation about specification. Those conversations belong in Jira.
Also please STOP logging a jira case just the moment before you push a PR. We know that you have been working on the PR for a few days. Give the rest of us chance to weigh in on the specification before you write the implementation.
Please fix the reference to the jira case. Is it 7122 or 7112?
OK, thanks for the reminder. I will list the JIRA in advance so that everyone can evaluate it together. @julianhyde
Can idempotent functions such as ceil and floor also be considered together?
Can idempotent functions such as ceil and floor also be considered together?
The ceil and floor functions have already been separately processed before, and the logic is a little complex, so they should be no need to handle them here.
Can idempotent functions such as ceil and floor also be considered together?
The ceil and floor functions have already been separately processed before, and the logic is a little complex, so they should be no need to handle them here.
I see, these two functions are binary functions. I don't have any other questions for now.
I see, these two functions are binary functions.
Numeric CEIL and FLOOR are unary. E.g. CEIL(CEIL(3.14)) = CEIL(3.14) = 4.
Datetime CEIL and FLOOR are binary. E.g. FLOOR(FLOOR(DATE '2025-02-08' TO MONTH)) = FLOOR(DATE '2025-02-08' TO MONTH) = DATE '2025-02-01'. The second parameter is a time unit, basically a literal over an enum type, so we could regard the functions as a family of unary functions, e.g. FLOOR_MONTH, CEIL_MONTH, FLOOR_YEAR, etc.
I see, these two functions are binary functions.
Numeric
CEILandFLOORare unary. E.g.CEIL(CEIL(3.14))=CEIL(3.14)=4.Datetime
CEILandFLOORare binary. E.g.FLOOR(FLOOR(DATE '2025-02-08' TO MONTH))=FLOOR(DATE '2025-02-08' TO MONTH)=DATE '2025-02-01'. The second parameter is a time unit, basically a literal over an enum type, so we could regard the functions as a family of unary functions, e.g.FLOOR_MONTH,CEIL_MONTH,FLOOR_YEAR, etc.
I agree with @julianhyde 's idea. I also have a suggestion—though I’m not sure how feasible it is—to consolidate all the currently identified unary idempotent functions into a unified encapsulation while ensuring compatibility with the existing ones in simplify . The goal would be to extract a shared logic layer to avoid scattering similar implementations across multiple places, which could deduce future maintenance costs. Currently, only four functions are handled uniformly, and without careful review, others might mistakenly treat CEIL and FLOOR as non-idempotent or assume they were overlooked. If modifying this proves too difficult, we could at least document all the relevant functions in the comments to prevent future developers from thinking the feature was implemented incompletely.
I’m not entirely sure if this suggestion is valid. If Julian has time, I’d appreciate it if you could take a look as well.
Floor and ceil are not purely unary functions also support binary, and they were previously handled separately, so I did not include them in the logic of the unary function list. They were not accidentally omitted, and I think it is not a problem to handle them uniformly in unary way.
But regarding the scalability mentioned,personally I believe that putting all idempotent functions on a unified list and identifying other types is consistent with the current approach. If there is no clear transformation plan, there is no need to abstract it, which may increase the understanding cost for subsequent developers or let someone with a deep understanding of the plan to come up with a clear plan to follow up separately.
Quality Gate passed
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
93.9% Coverage on New Code
0.0% Duplication on New Code
what is the status of this PR? It is marked as "changes requested". Have all change request been addressed?
what is the status of this PR? It is marked as "changes requested". Have all change request been addressed?
IMO it had been addressed all comments now. @mihaibudiu
@xiedeyantu can you please confirm that you are satisfied with these changes?
@xiedeyantu can you please confirm that you are satisfied with these changes?
@mihaibudiu I'm just concerned that the current implementation might be too loose and not conducive to long-term maintenance. What I was trying to express is whether each function itself could have a common property to indicate whether it is an idempotent function. If there were a way to achieve this point, it would also benefit functional dependency analysis later, such as determining whether a function is an injective function. Or perhaps I'm being too critical - logically I don't think there should be any errors in the current approach. If you think the current implementation is fine, I have no objection.
So your suggestion is to add a property to SqlFunction which says whether the function is idempotent? Frankly, I think there would be too many such properties to create for various interesting semantic properties of functions that could be used by the optimizer. Moreover, I think that the kinds of programs optimized by this PR never occur in practice, because this is just not how people write programs. So I think this is a lot of work for very little benefit.
So your suggestion is to add a property to SqlFunction which says whether the function is idempotent? Frankly, I think there would be too many such properties to create for various interesting semantic properties of functions that could be used by the optimizer. Moreover, I think that the kinds of programs optimized by this PR never occur in practice, because this is just not how people write programs. So I think this is a lot of work for very little benefit.
If I understand correctly, you also hope that all functions can have some necessary common properties, although not necessarily the idempotent property. If this PR can achieve minimal intrusive changes, I’m also fine with it. However, I also think that the optimization in the current PR is not very cost-effective.
@mihaibudiu I'm considering exploring whether it's feasible to add some useful attributes to functions. At a certain stage, achieving function dependency capabilities requires handling specific functions, but I'm not sure if adding attributes is a worthwhile effort for Calcite.
There are some properties like determinism, behavior with nulls, etc. which are very useful for optimizations. Idempotence - perhaps less. I think this should be judged on a case by case basis.
So let's review this once more and merge it if it works; we are spending a lot of time on an optimization which will probably be never applied in practice.
What's the status of this PR? Does it fulfill the scope that was agreed upon?
There is some concern that this is relatively intrusive, making code more complex and hard to maintain, and it does not really solve a practical problem. That's why it wasn't merged yet.
I would be more confident if someone could supply a real example where this would be beneficial. The only case I can imagine is for machine-generated SQL.
I would be more confident if someone could supply a real example where this would be beneficial. The only case I can imagine is for machine-generated SQL.
This should be an optimization for functions, similar to generating more efficient SQL. This idempotency elimination is a relatively obvious SQL optimization, but it is hard to say how much efficiency it can improve. For example, some enterprise SQL platforms will first optimize SQL, requiring the most concise and efficient SQL output, and then send it to the compute engine for execution. I think this optimization is profitable, although not much. @mihaibudiu
Can you show a (non-artificial) program written by some person or machine which would benefit from this optimization?
I hadn't noticed this information before, but I see that it mentions other architectural design suggestions, so we might need to discuss it further.
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 90 days if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.