Results 74 comments of BInwei Yang

@zhouyuan @zhixingheyi-tian

The same root cause as https://github.com/oap-project/gazelle_plugin/issues/906 We should add ARROW_CHECK for all cases where int16 is used as record batch size

The root cuse as https://github.com/oap-project/gazelle_plugin/issues/928

mmap shows worse performance than read/write. spill write mmap 2.23 4.3 read/write 0.84 1.49 Looks like It's because the difference of page fault handling. write doesn't cause major page fault...

related bug: dfw.repartition(144).count() `0` report error: INFO shuffle.ColumnarShuffleWriter: Skip ColumnarBatch of 32768 rows, 0 cols

dfx.coalesce(1).count() also return 0, Not sure if it's the same issue as repartition

query plan is the same: dfw.repartition(144).where("ss_customer_sk is null").count() dfw.where("ss_customer_sk is null").repartition(144).count()

there is no performance difference if we set PreferSpill=true. because the memory is allocated once only

Do you mean to use Spark's memory management system? Then we need to define a set of APIs for native library where they can be implemented based on native implementation....