soda-core icon indicating copy to clipboard operation
soda-core copied to clipboard

Remove dependency of soda-core-spark-df on pyspark

Open ghjklw opened this issue 1 year ago • 6 comments

Soda-core-spark-df declares a dependency on pyspark>=3.4.0, but pyspark used only for typing, and should therefore be declared as an extra dependency to avoid unnecessary dependency conflicts.

Closes #2217

ghjklw avatar Mar 08 '25 01:03 ghjklw

Thanks for the contrib @ghjklw. The pinned version ensures that the expected programmatic api (i.e. classes in this cases) is present. Version we are using is 2 years old, do you think that is still too much to ask?

m1n0 avatar Mar 17 '25 12:03 m1n0

Hi @m1n0, I think that in this case the version requirement is perfectly reasonable, no issue with that 🙂

The specific challenge I'm facing is that the packages pyspark and databricks-connect both provide the module pyspark and are incompatible as documented in https://docs.databricks.com/aws/en/dev-tools/databricks-connect/python/install#install-the-databricks-connect-client-with-venv and .https://docs.databricks.com/aws/en/dev-tools/databricks-connect/python/troubleshooting#conflicting-pyspark-installations

To be fair, this is much more of a Databricks issue than a Soda issue, but it's also much easier to fix on this side 😉 In practice, when I override soda-core-spark-df dependency on pyspark, it works flawlessly.

By "ensures that the expected programmatic api (i.e. classes in this cases) is present", do you mean:

  • when running unit/integration tests? In this case I believe my PR covers it as we still install PySpark to run the tests

  • at runtime? If that is the case, I think (personal opinion 😉) that the solution has stronger consequences than the risk it mitigates. The only PySpark methods used that I found in the code are: SparkSession.sql, DataFrame.collect, DataFrame.count, DataFrame.offset, DataFrame.limit, DataFrame.schema.fields. Of these, only offset is somewhat new (version 3.4.0).

    Do you think it might be a good enough solution to update the documentation to say to install soda-spark-df[pyspark] unless you have an existing pyspark>=3.4.0 installation? This would work for 99% of persons, but those who need the flexibility on how they manage their PySpark environment would still be able to do so. You could even add a specific note about databricks-connect.

ghjklw avatar Mar 17 '25 13:03 ghjklw

The offset method is what caused the pin to be >= 3.4.0. I don't think just having it in docs would be a good-enough UX, especially coming back from the more strict one that we have now. The pin was motivated by multiple users coming to us with .cursor() is not defined issues.

I would rather move forward and not back on this, how about and extra that removes the pyspark dependency? Doing that and adding it to docs would not break existing user experience and allow users in scenarios like yours to move forward without friction

m1n0 avatar Mar 18 '25 13:03 m1n0

That would be a good option, but I'm not aware of any way to define an extra with "negative" dependencies.

There have been several discussions related to similar problems:

  • Negative extras: https://discuss.python.org/t/the-extra-environment-marker-and-its-operators/4976
  • Alternative dependencies (i.e. declaring that databricks-connect is a valid alternative to pyspark): https://discuss.python.org/t/require-any-of-several-alternative-package-dependencies/26293
  • Default extras (i.e. setting pyspark in an extra group and declaring that this extra skills be installed by default): https://peps.python.org/pep-0771/ This last draft PEP actually discusses the exact problem we are facing: "Recommended but not required dependencies"

It looks like this will be coming via will take some time.

There are only 2 options left I can think of:

  • Runtime check, but this might not really address your concern for backward compatibility.
  • Defining a separate package, e.g. soda-core-spark-df vs soda-spark-df which adds the pyspark dependency. However this is a quite heavy solution...

ghjklw avatar Mar 19 '25 06:03 ghjklw

The offset method is what caused the pin to be >= 3.4.0. I don't think just having it in docs would be a good-enough UX, especially coming back from the more strict one that we have now. The pin was motivated by multiple users coming to us with .cursor() is not defined issues.

I would rather move forward and not back on this, how about and extra that removes the pyspark dependency? Doing that and adding it to docs would not break existing user experience and allow users in scenarios like yours to move forward without friction

Hi, just in case it adds something to the discussion. I had some issues with local tests upgrading from soda-core-spark-df 3.3.5 to 3.3.6, because of the offset method. If I am not wrong, the offset method is available in Scala in Apache Spark from 3.4.0 (as mentioned in doc) but it was added to pyspark in 3.5.0 release so executing it in previous versions fail despite the 3.4.0 pyspark requirement.

migueldoblado avatar May 25 '25 17:05 migueldoblado