Nick Karlov comments

Results 19 comments of


                                            Nick Karlov

Bad performance on wide tables (1000+ columns)

@alamb , thank you for reply! I will continue posting about bottlenecks in DF (for instance I've noticed degradation DF performance due to aggressive concurrency in tokio scheduller and workarounded...

Bad performance on wide tables (1000+ columns)

> various ways to make DataFusion's planing faster Also it's good to consider implementing prepared physical plans (with parametrization) it will add an ability to cache them

Bad performance on wide tables (1000+ columns)

@alamb take a look at the PR https://github.com/apache/arrow-datafusion/pull/7870 please, where @oleggator has implemented BTree instead of list. It's improved physical plan construction x2 times

Bad performance on wide tables (1000+ columns)

> it is still far from optimal I think it's a good idea to cache instances of DFSchema (and Arrow Schema as well). Tho most flexible way is to implement...

Bad performance on wide tables (1000+ columns)

Another thought is to use cache of physical plan (I tested serialized into protobuf optimized physical plan as a cache and it leads to increasing of performance dramatically)

Bad performance on wide tables (1000+ columns)

@alamb Hi! Could you please let us know if any work is planned here? We noticed that performance of DaraFusion in case of wide tables slow down significantly from version...

Bad performance on wide tables (1000+ columns)

@alamb we tested the same perf test on 37.1 and it seems that now 99% of request time is spent on planning and optimizing (creating and optimizing of logical plan,...

Bad performance on wide tables (1000+ columns)

Thank you for your reply @alamb! We'll check it on 38 and share results. This particular example is synthetical as we implemented it using pure memory tables without any external...

[EPIC] (Even More) Grouping / Group By / Aggregation Performance

Hi! There is great job done here! I faced with an issues with CoalesceBatches: it seams that there is a performance killer somewhere in CoalesceBatchesStream. It's spending too much time...

[EPIC] (Even More) Grouping / Group By / Aggregation Performance

Another topic related issue is performance of **RowConverter** used for grouping. More than 75% of GroupedHashAggregateStream work is converting composite aggregation key to row Apprx 50% of GroupedHashAggregateStream work is...