dagster
dagster copied to clipboard
Spark Connect support for dagster
What's the use case?
Spark Connect is a different Spark architecture that's now used by some vendor runtimes, like Databricks Serverless.
There are some breaking Spark changes with Spark Connect. For example, this code that accesses the sparkContext (self.spark_session.sparkContext) will not work.
You should be able to restructure the code so that it works with both traditional Spark & Spark Connect.
Ideas of implementation
You can install Spark Connect with pip install spark[connect] and see what's breaking. You can also just try out Dagster on a serverless Databricks cluster to see if it works as expected. Looks like dagster accepts Spark RDDs and those aren't supported by Spark Connect, so that might be a place the code will break.
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
Cool to see that this already existed. The error we're currently facing with Spark Connect is with a different class for DataFrame (pyspark.sql.DataFrame as opposed to pyspark.sql.connect.DataFrame). This happens in Databricks when you use the latest runtimes on a shared cluster rather than single-user.
I do think it'd be nice if Spark could make these more intertwined, but having Dagster be able to handle both would be a good first step as well.
Cool to see that this already existed. The error we're currently facing with Spark Connect is with a different class for DataFrame (pyspark.sql.DataFrame as opposed to pyspark.sql.connect.DataFrame). This happens in Databricks when you use the latest runtimes on a shared cluster rather than single-user.
I do think it'd be nice if Spark could make these more intertwined, but having Dagster be able to handle both would be a good first step as well.
They will be! In Spark 4.0, all of the Spark dataframes (Connect, "classic", etc.) inherit from a common base class.