fugue icon indicating copy to clipboard operation
fugue copied to clipboard

Does this require pandas to be installed on cluster nodes?

Open cyberfox1 opened this issue 2 years ago • 2 comments

The docs are extremely obscure, but seem to claim both that pandas functions will run "on spark", but also that pandas will be applied to partitions of data.

So is a function that uses pandas converted to a spark equivalent, or does pandas still execute on cluster nodes?

Getting pandas installed across a spark cluster is problematic for me.

cyberfox1 avatar Jul 27 '22 01:07 cyberfox1

Hi @cyberfox1 ,

Thanks for the issue! The README should have a link to the tutorials, which are the main form of documentation. I will point the Fugue API docs to the tutorials in the near future. Pandas will be applied on the partitions of the data. You can think of it as PySpark's pandas_udf or mapInPandas. So it is not converted to a Spark equivalent, it is brought with one of the underlying Spark methods (can also be mapPartitions or udf). Spark has a lot of interfaces and we aim to condense them to make it friendlier to users.

I think Pandas should be a requirement of Spark, especially in 3.0 and above where mapInPandas exists. What Spark version are you on? Could you give more details on why installing pandas is problematic for you?

kvnkho avatar Jul 27 '22 02:07 kvnkho

Hi @cyberfox1 ,

Thanks for the issue! The README should have a link to the tutorials, which are the main form of documentation. I will point the Fugue API docs to the tutorials in the near future. Pandas will be applied on the partitions of the data. You can think of it as PySpark's pandas_udf or mapInPandas. So it is not converted to a Spark equivalent, it is brought with one of the underlying Spark methods (can also be mapPartitions or udf). Spark has a lot of interfaces and we aim to condense them to make it friendlier to users.

I think Pandas should be a requirement of Spark, especially in 3.0 and above where mapInPandas exists. What Spark version are you on? Could you give more details on why installing pandas is problematic for you?

I will have to investigate but in short we have workloads that could be deployed to arbitrary corporate run clusters we have little or no control over.

cyberfox1 avatar Jul 27 '22 02:07 cyberfox1

Closing, feel free to reopen. Thanks

goodwanghan avatar Aug 16 '22 01:08 goodwanghan