dagster icon indicating copy to clipboard operation
dagster copied to clipboard

Spark Connect support for dagster

Open MrPowers opened this issue 1 year ago • 1 comments

What's the use case?

Spark Connect is a different Spark architecture that's now used by some vendor runtimes, like Databricks Serverless.

There are some breaking Spark changes with Spark Connect. For example, this code that accesses the sparkContext (self.spark_session.sparkContext) will not work.

You should be able to restructure the code so that it works with both traditional Spark & Spark Connect.

Ideas of implementation

You can install Spark Connect with pip install spark[connect] and see what's breaking. You can also just try out Dagster on a serverless Databricks cluster to see if it works as expected. Looks like dagster accepts Spark RDDs and those aren't supported by Spark Connect, so that might be a place the code will break.

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

MrPowers avatar Feb 13 '24 19:02 MrPowers

Cool to see that this already existed. The error we're currently facing with Spark Connect is with a different class for DataFrame (pyspark.sql.DataFrame as opposed to pyspark.sql.connect.DataFrame). This happens in Databricks when you use the latest runtimes on a shared cluster rather than single-user.

I do think it'd be nice if Spark could make these more intertwined, but having Dagster be able to handle both would be a good first step as well.

matt-weingarten avatar Nov 22 '24 15:11 matt-weingarten

Cool to see that this already existed. The error we're currently facing with Spark Connect is with a different class for DataFrame (pyspark.sql.DataFrame as opposed to pyspark.sql.connect.DataFrame). This happens in Databricks when you use the latest runtimes on a shared cluster rather than single-user.

I do think it'd be nice if Spark could make these more intertwined, but having Dagster be able to handle both would be a good first step as well.

They will be! In Spark 4.0, all of the Spark dataframes (Connect, "classic", etc.) inherit from a common base class.

deepyaman avatar Mar 26 '25 03:03 deepyaman