Andrew Lamb comments

Results 1674 comments of


                                            Andrew Lamb

Generate GroupByHash output in multiple RecordBatches

Thank you @JasonLi-cn I wonder if we have tested the performance of this branch? I worry that the incremental output generation will result in a copying the values multiple times...

Generate GroupByHash output in multiple RecordBatches

> This test may be incomplete, do you @alamb have any better test suggestions? 🤔 Hi @JasonLi-cn -- yes I think we should run the ClickBench and TPCH benchmarks using...

Generate GroupByHash output in multiple RecordBatches

I hit a bug https://github.com/apache/datafusion/pull/11833 that has been fixed on main when trying to run the benchmarks on this branch: ``` Query 19 avg time: 161.22 ms Error: External(ArrowError(InvalidArgumentError("column types...

Generate GroupByHash output in multiple RecordBatches

🤔 ``` Query 17 iteration 4 took 6144.7 ms and returned 10 rows Q18: SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase"...

Generate GroupByHash output in multiple RecordBatches

> Thank you @alamb 🙏. Let me analyze it further 🤔 In order to actually generate the output in multiple batches and gain performance, I think we would need to...

Generate GroupByHash output in multiple RecordBatches

> I agree, finally it should be a big change which switches the group values and related states managed by block like duckdb , and I am working on this(https://github.com/apache/datafusion/issues/11931)....

Generate GroupByHash output in multiple RecordBatches

> I think maybe we make it equal to batch_size in most cases, and so that we can avoid any split operations during producing output? And for the cornercase, for...

Generate GroupByHash output in multiple RecordBatches

I believe the plan here is that we will work to improve the coverage of aggregates and then revisit / revive this design

short-circuited expression should be evaluated one by one

I think this idea was largely implemented by @acking-you in - https://github.com/apache/datafusion/pull/15694

short-circuited expression should be evaluated one by one

We merged a fix for this in the 51 release, but we have found as subtle problem so I am planning to revert it for 51 (the fix should appear...