narwhals feat: Add minimal Pyspark support

What type of PR is this? (check all applicable)

[ ] 💾 Refactor
[x] ✨ Feature
[ ] 🐛 Bug Fix
[ ] 🔧 Optimization
[ ] 📝 Documentation
[ ] ✅ Test
[ ] 🐳 Other

Related issues

Closes #333

Checklist

[x] Code follows style guide (ruff)
[ ] Tests added
[ ] Documented the changes

If you have comments or can explain your changes, please do so below.

As mentioned in the latest call, I've started working on the support for PySpark.

The goal of this PR would be to have a minimal initial implementation as a starting point. As we did for Dask, we can implement single methods in following PRs!

⚠️ this is not ready for review, a lot of test are failing, the code is ugly. :) Just making the PR for visibility and to have a place to comment/ask questions on specific points

Sep 03 '24 21:09 EdAbati

This PR diff is getting big because of all the xfail in tests. 😕 @MarcoGorelli @FBruzzesi do you have a better idea on how to make it more "reviewable"? or do you think it is fine?

Sep 12 '24 06:09 EdAbati

This PR diff is getting big because of all the xfail in tests. 😕 @MarcoGorelli @FBruzzesi do you have a better idea on how to make it more "reviewable"? or do you think it is fine?

For dask we started with it's own test file, so that we didn't have to modify every other file. Once we had a few methods implemented we shifted the constructor into the conftest list of constructors and added the xfails.

Would that be a good strategy again?

Sep 12 '24 06:09 FBruzzesi

This is finally ready for review 🥵

I have to fix the tests on windows (I think I may need to set up Java or something similar) and the test with old version of pandas (that is not compatible with PySpark) But the rest should be ready!

I tried my best to keep it as small as possible while trying to implement the main functionality. Let me know what you think

Oct 13 '24 17:10 EdAbati

@MarcoGorelli @FBruzzesi which criteria did we use to decide the minimal supported versions? Popularity? Time of release?

Oct 14 '24 10:10 EdAbati

i'd suggest, the lowest one that's not too difficult to support 😄 I think it'd be OK to set it quite high here, we can always work on lowering it later if there's demand

Oct 14 '24 17:10 MarcoGorelli

Because of pyspark current requirements, trying to make all the tests pass is a bit tricky.

I decided to make 3.3.0 the minimum dependency, the most recent version is 3.5.2. None of these is technically compatible* with Python 3.12. Is there a way to exclude pyspark from 3.12 tests without affecting coverage?

* technically compatible: Ubuntu tests on 3.12 seem fine, Windows complains. But Python 3.12 will be officially supported starting from 4.0.0: https://issues.apache.org/jira/browse/SPARK-44120

Oct 15 '24 20:10 EdAbati

looks like there's some mysterious error in CI

ERROR tests/spark_test.py::test_std[_pyspark_constructor_with_session] - pytest.PytestUnraisableExceptionWarning: Exception ignored in: <function __install_postfork_hook.<locals>.before_hook at 0x7f32678c40d0>

If we can get CI green, I think it'd be OK to merge this so we can then incrementally improve and avoid the PR going stale? Well done sticking with this one 💪

Nov 18 '24 10:11 MarcoGorelli

I think it is because of a new warning introduced yesterday in polars==1.14 source. I need to check if we import polars in any code triggered by the spark workers (or similar)

Update: it seems to be triggered by the SparkSession because it is always raised in the 1st test (and test order is random). I'm trying to find a way to catch it so it doesn't need to be ignored for all tests

Nov 18 '24 17:11 EdAbati

Tests are green (a part from an unrelated one)

Please let me know if you or @FBruzzesi have any other questions or ideas how to improve this :) I think I addressed everything, some questions are still open

Nov 21 '24 07:11 EdAbati

amazing, thanks for updating! happy to merge once it's green as 🥦

Dec 03 '24 21:12 MarcoGorelli

Thank you all for the feedback and apologies this is taking longer but I am a bit busy.

The CI is almost green 😅 I need to add some if-else because pyspark is not available in python>3.11. Hopefully it will be fixed tonight

regarding the issue with df.with_columns(d=nw.col('a').mean()).collect().to_native() I may have an idea on how to do it. I try to squeeze this is in too (and add a test). If this takes longer, could it be a follow-up?

Dec 04 '24 12:12 EdAbati

If this takes longer, could it be a follow-up?

yup definitely!

Dec 04 '24 13:12 MarcoGorelli

Ok CI is 🌲🌳

Regarding that issue, I am still working on a fix 🥲

Dec 05 '24 07:12 EdAbati

Exciting 🎉🎉🎉 thanks!

Dec 05 '24 08:12 EdAbati