narwhals icon indicating copy to clipboard operation
narwhals copied to clipboard

[Enh]: Add Support For PySpark

Open ELC opened this issue 1 year ago β€’ 12 comments

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

No response

Please describe the purpose of the new feature or describe the problem to solve.

PySpark is one of the most used dataframe processing frameworks in the big data space. There are whole companies built around it and it is commonplace in the Data Engineering realm.

One common pain point is that data scientists usually works with Pandas (or more recently Polars) and when integrating their code in big ETL processes, that code is usually converted to PySpark for efficiency and scalability.

I believe that is precisely the problem Narwhals tries to solve and it would be a great addition to the data ecosystem to support PySpark.

Suggest a solution if possible.

PySpark has two distinct APIs:

  • PySpark SQL - Identical API to Scala version of Spark
  • PySpark Pandas - Subset of Pandas API (Previously known as Koalas)

Given that PySpark Pandas has an API based on Pandas, I believe it should be relatively straightforward to re-use the code already written for the Pandas backend.

There is a PySpark SQL to PySpark Pandas conversion so in theory it should be possible to also add ad hoc support for PySpark SQL Dataframes and check the overhead. If it is too big it can be considered to also add a separate backend for that different API.

If you have tried alternatives, please describe them below.

No response

Additional information that may help us understand your needs.

I do have experience with the PySpark API and would like to contribute, I read the "How it works" section but would like some concrete direction on how to get started and if this is of interest to the maintainers

ELC avatar Jun 24 '24 02:06 ELC

Thanks @ELC for your request! Yup, this definitely in scope and of interest!

MarcoGorelli avatar Jun 24 '24 07:06 MarcoGorelli

I was literally coming on this channel to ask the same question - love it!! And would also be interested in contributing.

jahall avatar Jun 26 '24 10:06 jahall

Fantastic, thank you! I did use pyspark in a project back in 2019, but I think I've forgotten most of it by now πŸ˜„ From what I remember, it's a bit difficult to set up? Is there an easy way to set it up locally so we use it for testing to check that things work?

PySpark Pandas

It might be easiest to just start with this to be honest. Then, once we've got it working, we can remove a layer of indirection

In terms of contributing, I think if you take a look at narwhals._pandas_like, that's where the implementation for the pandas APIs is. There's a _implementation field which keeps track of which pandas-like library it is (cudf, modin, pandas). Maybe it's as simple as just adding pyspark to the list?

MarcoGorelli avatar Jun 26 '24 11:06 MarcoGorelli

I had experience working with Azure Pipeline agents which I believe are the same VM Agents running on Github actions and they come with all relevant Java dependencies pre-installed so having the test run on the CICD should not be a problem.

As per local development, there are a couple of options:

  • Setting PySpark locally - Really troublesome
  • Using VS Code with Dev Containers - Requires docker desktop installation
  • Using Github's workspaces with a PySpark image - Easiest but relies on a working internet connection

For this contribution I will go with the third option as it is the fastest and easiest to set up. If you would like me to set up the necessary files for the second one, I can do that two on a separate issue

I will have a look at the _pandas_like and _implementation files to have a look at what's needed and will keep you posted on the progress.

ELC avatar Jun 26 '24 23:06 ELC

Hey folks, I have been working on this feature on a local branch. I have a few questions:

  • What is the best way to gradually develop and implement this?

Adding support for a whole DataFrame library is quite a big feature, and it is a little all or nothing. I think it would be useful if any changes/choices being made are picked up early on.

Have we considered the following for pyspark pandas?

  • pyspark pandas converts to pyarrow, then pandas. There is quite often type confusion somewhere in the process. There is also a significant, but often overlooked, overhead of these conversions. Performance is generally much better with straight pyspark.
  • pyspark pandas doesn't work well with the latest version of pandas, it is particularly unstable with pandas >2. I have had to limit the pandas version in dev-requirements in my local branch to pandas<2.1.0 because pandas pyspark still uses this deprecation.

The local branch I have been working on uses straight pyspark. It would also be possible to add pyspark pandas as it's own pandas-like implementation AND the straight pyspark API and give users the choice.

This is perhaps something to be explored in a later feature, but if we add support for the pyspark API we could possibly have a root to much more library support through sqlframe. I prefer to implement pyspark over pandas pyspark whether or not sql frame is an option for narwhals.

I am not developing using pyspark pandas, but I still had to limit my pandas version to <2.1.0 on my local branch. Why is this?

Following the lead from dask lazyframe, I have been implementing the collect method as returing a pandas pandaslike DataFrame. Spark then needs to still needs to interface with pandas, but this then has the deprecation issue >=2.1.0

I think there might be a better way. The Pyspark 4.0.0 preview adds a toArrow feature. The collect -> eager implementation can then return an arrow DataFrame. There is only one conversion here (pyspark -> arrow) rather than three (pyspark -> pandas arrow). Ritchie Vink has made a related comment about the two conversion being preferable in the past, back when there was no official way of going to arrow from spark. The challenge here is then this would make at least the eager implementation in narwhals incompatible with pyspark<4.0.0 (when spark 4.0.0 is actually released). Any thoughts on this?

TomBurdge avatar Aug 21 '24 06:08 TomBurdge

hey @ELC I usually use SDKMAN when developing pyspark locally and it usually works pretty nicely in my experience (on WSL, set up on a few different laptops). It seems like a lot of pyspark developers aren't aware of this approach, because the tool is most used by java developers, and plenty of pyspark developers don't know java (at least I don't ☺️ ).

SDKMAN allows you to, at least the way I think about, create the java equivalent of a virtual directory for your repo via a .sdkmanrc file at the root. Right now I have the following:

java=21.0.4-tem
spark=3.5.1
hadoop=3.3.5

along with adding pyspark to the dev-requirements.txt, I was good to go. I do agree with you that some approach which uses containerization is likely to work best for the most people.

TomBurdge avatar Aug 21 '24 06:08 TomBurdge

Nice one, thanks @TomBurdge ! would be great to have you on board as contributor here!

I think converting to Arrow might make sense in collect. If anyone wants to support earlier versions of spark and convert to pandas, they can always do that manually with something like

if is_spark_lazyframe(df):
    df = nw.from_native(nw.to_native(df).toPandas(), eager_only=True)
else:
    df = df.collect()

But I think that data science libraries which explicitly support Spark are quite rare unfortunately - well, fortunately for us, it means that backwards-compatibility is less of a concern, and so they likely wouldn't mind only support Spark 4.0+ too much πŸ˜„

MarcoGorelli avatar Aug 21 '24 07:08 MarcoGorelli

Thanks @MarcoGorelli , yeah keen to get involved, spare time permitting! I will try to make it to one of the community calls.

Question on intended behaviour:

For the rename method, the relevant spark function is: withColumnRenamed. It has a major gotcha, which is that if the original column is not present, then it is a no-op (mentioned in the docs). This behaviour has caused me a lot of frustrations. πŸ™ƒ

What is the narwhals intended behaviour? Strict? The narwhals tests in tests/frame/rename_test.py aren't opinionated on strict/non-strict currently.

TomBurdge avatar Aug 26 '24 13:08 TomBurdge

ah interesting, thanks @TomBurdge ! I think that, as Narwhals is mainly aimed at tool-builders, we'd be better-off using the strict behaviour

MarcoGorelli avatar Aug 26 '24 16:08 MarcoGorelli

Hi @TomBurdge I completely missed these last replies and I have started implementing something for this. Do you also have something implemented? We can collaborate on this and merge what we have :)

I think we should aim for a "minimal" implementation now and then have follow-up PRs for the rest of the methods.

EdAbati avatar Sep 03 '24 21:09 EdAbati

Just letting you know that I'm following this with great interest and could potentially be helpful with testing once you have something ready.

morrowrasmus avatar Sep 16 '24 14:09 morrowrasmus

I am new to Spark, just for fun, I tried it locally with: https://github.com/jplane/pyspark-devcontainer

mattcristal avatar Sep 26 '24 17:09 mattcristal

Given that DuckDB also has a PySpark API, maybe this would give us DuckDB support almost for free (we could make a _spark_like folder, like the current _pandas_like one) https://duckdb.org/docs/api/python/spark_api.html

MarcoGorelli avatar Nov 17 '24 07:11 MarcoGorelli

hey everyone - came to this thread after trying to run nw.from_native(data) on a PySpark Pandas DataFrame and saw the TypeError hinting that it's not supported. double checked that:

  • none of the checks in _from_native_impl would catch this either, namely is_native_pandas_like nor is_native_spark_like
  • couldn't find any open/closed PRs regarding PySpark Pandas

just wanted to check if anyone had an update on this? no rush - happy to help contribute!

lucas-nelson-uiuc avatar May 08 '25 18:05 lucas-nelson-uiuc

hey @lucas-nelson-uiuc - could you post the full traceback please?

MarcoGorelli avatar May 08 '25 18:05 MarcoGorelli

PySpark Pandas DataFrame

ah pyspark-pandas

I'd say that that warrants a separate issue, could you open a new one please?

MarcoGorelli avatar May 08 '25 18:05 MarcoGorelli

hey @lucas-nelson-uiuc - could you post the full traceback please?

i swear i know how to copy-paste - couldn't send from work laptop or recreate this on my own in Colab :')

Image

lucas-nelson-uiuc avatar May 08 '25 18:05 lucas-nelson-uiuc