pandas-ai icon indicating copy to clipboard operation
pandas-ai copied to clipboard

Support for Pyspark dataframe

Open rishabh-dream11 opened this issue 10 months ago • 4 comments

🚀 The feature

Pyspark is used widely in the community for ETL work involving large datasets. Adding support for it will increase adoption for the product.

Motivation, pitch

My org uses, Pyspark as the only framework for ETL, EDA is done by visualising various cuts of the same pyspark dataframe.

Alternatives

No response

Additional context

No response

rishabh-dream11 avatar Apr 11 '24 10:04 rishabh-dream11

This would be an interesting addition. Not sure about how easy it would be to add support for pyspark in the current setup, but it's definitely worth exploring. So you would like to use pyspark as an engine if I understand correctly. Or you just want to be able to provide a spark dataframe as an input?

gventuri avatar Apr 13 '24 21:04 gventuri

Pyspark engine and that has to support spark dataframe as input.

rishabh-dream11 avatar Apr 17 '24 18:04 rishabh-dream11

@gventuri Is there any progress/discussion on this issue? Will this be considered for future releases?

rishabh-dream11 avatar May 15 '24 06:05 rishabh-dream11

@gventuri I am also wondering if it can execute pyspark code. It took too long to query a table which is large. Or is there any workaround to replace the code to pyspark code inside the pipeline?

ssling0817 avatar May 31 '24 01:05 ssling0817