ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

convert sparkdf to pdf within arrow

Open dding3 opened this issue 1 year ago • 1 comments

Description

In the previous implementation, we convert rdd of spark row to pandas dataframe directly, in this pr, we convert spark row to arrow table first, then convert arrow table to pandas dataframe. Below is init test perf data config: 1.1g csv file, 55g memory, 10 cores

without arrow: 29s

with arrow: 40s

Test code: [root@clx001]/home/ding/with_arrow.py, without_arrow.py

dding3 avatar Jul 15 '22 04:07 dding3

You may refer to the Pandas UDF implementations in Spark for using arrow for spark df and pandas df conversion.

jason-dai avatar Jul 15 '22 08:07 jason-dai