SynapseML Making synapse.ml compatible with pandas_on

Making synapse.ml compatible with pandas_on_pyspark

Open Nitinsiwach opened this issue 2 years ago • 2 comments

With pyspark 3.2.0 now supporting pandas_on_pyspark (earlier koalas), this would make things much easier for a lot of us travelling from the faraway world of pandas. Currently, we need to manipulate the data using pandas_on_pyspark API and whenever an interaction with synapse.ml (like featurizer) is needed a necessary .to_spark and then .to_pandas_on_pyspark is needed. I currently have a context manager to do this and I wrap all my code that needs synapse.ml within this context_manager

Dec 14 '21 19:12 Nitinsiwach

Thanks for raising this @Nitinsiwach, do SparkML models have a better API/Usability pattern than our models? If there is a systematic difference between our API and SParkML pyspark API we would want to close the gap

Dec 15 '21 19:12 mhamilton723

Hello @mhamilton723, Sorry for the delayed response. Was down with health issues.

I have not used sparkML so I won't be able to have an opinion on it. I guess I failed to communicate the issue properly. I will try again:

Pyspark 3.2.0 has a pandas compatibility module called pandas. It can be imported with from pyspark import pandas. This is done so as to do away with the cost of switching from pandas to pyspark and learning a new API. I find SynapseML amazing and it has been a wonderful experience using it. But, the SynapseML API consumes only spark dataframes. Though now I am getting better at writing code in pyspark, initially I tried to heavily use pandas on pyspark however, every time I had to use anything from SynapseML I had to do a customary dataframe.to_spark() before calling a SynapseML method and dataframe.to_pandas_on_spark() after. I just thought it might have been better if the SynapseML API was compatible with pandas on pyspark. But now I am not sure if the benefits outweigh the costs of implementing such a thing.

Jan 01 '22 18:01 Nitinsiwach

SynapseML SynapseML copied to clipboard

Making synapse.ml compatible with pandas_on_pyspark

SynapseML
SynapseML copied to clipboard