SynapseML
SynapseML copied to clipboard
Making synapse.ml compatible with pandas_on_pyspark
With pyspark 3.2.0 now supporting pandas_on_pyspark
(earlier koalas), this would make things much easier for a lot of us travelling from the faraway world of pandas. Currently, we need to manipulate the data using pandas_on_pyspark
API and whenever an interaction with synapse.ml (like featurizer
) is needed a necessary .to_spark
and then .to_pandas_on_pyspark
is needed. I currently have a context manager to do this and I wrap all my code that needs synapse.ml within this context_manager
Thanks for raising this @Nitinsiwach, do SparkML models have a better API/Usability pattern than our models? If there is a systematic difference between our API and SParkML pyspark API we would want to close the gap
Hello @mhamilton723, Sorry for the delayed response. Was down with health issues.
I have not used sparkML so I won't be able to have an opinion on it. I guess I failed to communicate the issue properly. I will try again:
Pyspark 3.2.0 has a pandas compatibility module called pandas. It can be imported with from pyspark import pandas
. This is done so as to do away with the cost of switching from pandas to pyspark and learning a new API. I find SynapseML
amazing and it has been a wonderful experience using it. But, the SynapseML
API consumes only spark dataframes. Though now I am getting better at writing code in pyspark, initially I tried to heavily use pandas on pyspark
however, every time I had to use anything from SynapseML
I had to do a customary dataframe.to_spark()
before calling a SynapseML
method and dataframe.to_pandas_on_spark()
after. I just thought it might have been better if the SynapseML
API was compatible with pandas on pyspark
. But now I am not sure if the benefits outweigh the costs of implementing such a thing.