Remove dependency of soda-core-spark-df on pyspark
Soda-core-spark-df declares a dependency on pyspark>=3.4.0, but pyspark used only for typing, and should therefore be declared as an extra dependency to avoid unnecessary dependency conflicts.
Closes #2217
Quality Gate passed
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
Thanks for the contrib @ghjklw. The pinned version ensures that the expected programmatic api (i.e. classes in this cases) is present. Version we are using is 2 years old, do you think that is still too much to ask?
Hi @m1n0, I think that in this case the version requirement is perfectly reasonable, no issue with that 🙂
The specific challenge I'm facing is that the packages pyspark and databricks-connect both provide the module pyspark and are incompatible as documented in https://docs.databricks.com/aws/en/dev-tools/databricks-connect/python/install#install-the-databricks-connect-client-with-venv and .https://docs.databricks.com/aws/en/dev-tools/databricks-connect/python/troubleshooting#conflicting-pyspark-installations
To be fair, this is much more of a Databricks issue than a Soda issue, but it's also much easier to fix on this side 😉 In practice, when I override soda-core-spark-df dependency on pyspark, it works flawlessly.
By "ensures that the expected programmatic api (i.e. classes in this cases) is present", do you mean:
-
when running unit/integration tests? In this case I believe my PR covers it as we still install PySpark to run the tests
-
at runtime? If that is the case, I think (personal opinion 😉) that the solution has stronger consequences than the risk it mitigates. The only PySpark methods used that I found in the code are:
SparkSession.sql,DataFrame.collect,DataFrame.count,DataFrame.offset,DataFrame.limit,DataFrame.schema.fields. Of these, onlyoffsetis somewhat new (version3.4.0).Do you think it might be a good enough solution to update the documentation to say to install
soda-spark-df[pyspark]unless you have an existingpyspark>=3.4.0installation? This would work for 99% of persons, but those who need the flexibility on how they manage their PySpark environment would still be able to do so. You could even add a specific note aboutdatabricks-connect.
The offset method is what caused the pin to be >= 3.4.0. I don't think just having it in docs would be a good-enough UX, especially coming back from the more strict one that we have now. The pin was motivated by multiple users coming to us with .cursor() is not defined issues.
I would rather move forward and not back on this, how about and extra that removes the pyspark dependency? Doing that and adding it to docs would not break existing user experience and allow users in scenarios like yours to move forward without friction
That would be a good option, but I'm not aware of any way to define an extra with "negative" dependencies.
There have been several discussions related to similar problems:
- Negative extras: https://discuss.python.org/t/the-extra-environment-marker-and-its-operators/4976
- Alternative dependencies (i.e. declaring that databricks-connect is a valid alternative to pyspark): https://discuss.python.org/t/require-any-of-several-alternative-package-dependencies/26293
- Default extras (i.e. setting pyspark in an extra group and declaring that this extra skills be installed by default): https://peps.python.org/pep-0771/ This last draft PEP actually discusses the exact problem we are facing: "Recommended but not required dependencies"
It looks like this will be coming via will take some time.
There are only 2 options left I can think of:
- Runtime check, but this might not really address your concern for backward compatibility.
- Defining a separate package, e.g. soda-core-spark-df vs soda-spark-df which adds the pyspark dependency. However this is a quite heavy solution...
The
offsetmethod is what caused the pin to be>= 3.4.0. I don't think just having it in docs would be a good-enough UX, especially coming back from the more strict one that we have now. The pin was motivated by multiple users coming to us with.cursor() is not definedissues.I would rather move forward and not back on this, how about and extra that removes the pyspark dependency? Doing that and adding it to docs would not break existing user experience and allow users in scenarios like yours to move forward without friction
Hi, just in case it adds something to the discussion. I had some issues with local tests upgrading from soda-core-spark-df 3.3.5 to 3.3.6, because of the offset method. If I am not wrong, the offset method is available in Scala in Apache Spark from 3.4.0 (as mentioned in doc) but it was added to pyspark in 3.5.0 release so executing it in previous versions fail despite the 3.4.0 pyspark requirement.