Andy Grove
Andy Grove
Disabling sortMergeJoin via configs restores the original performance.
I ran again with latest from main (0033), and then with SMJ + join filter disabled manually (0034). Here are the event logs. [app-20240904131653-0033.gz](https://github.com/user-attachments/files/16877346/app-20240904131653-0033.gz) [app-20240904132048-0034.gz](https://github.com/user-attachments/files/16877355/app-20240904132048-0034.gz)
Here is a screenshot comparing the plans with SMJ+filter enabled on the left and disabled on the right. 
The `Display` implementation for `ScalarValue` changed between DataFusion 37 (the version that Ballista is using) and the version that Comet version. In the older version, Date32 is shown as an...
I tested a prototype of optimizing this filter and saw a 7% improvement in filter time for this query. It seems worth implementing.
> This might work ok for tpc-h but tpc-ds data has nulls and the null check is required perhaps? Does ballista know about the nullability of the data? Yes, the...
Related to this, it would be nice if we could improve the metrics for CometHashAggregate to show the time for evaluating the aggregate input expressions. I am not sure how...
> Good finding. I think this kind of optimization should be in Spark optimizer instead. It would make sense for Spark to add this, but I think that it could...
There is now a DataFusion PR to add this feature: https://github.com/apache/datafusion/pull/13046
The DataFusion PR https://github.com/apache/datafusion/pull/13046 is still waiting for a review. I am adding this issue back onto the 0.6 milestone as a reminder.