vizier-scala icon indicating copy to clipboard operation
vizier-scala copied to clipboard

Eliminate direct dependencies on pyspark

Open okennedy opened this issue 2 years ago • 1 comments

Per #58 we're going to want to create multiple environments... but pyspark is a really heavyweight dependency. We're going to want to remove it if possible. Specific places where it shows up:

  • client.py: export_module annotates functions with pyspark type annotations (#219 should eliminate this need)
  • client.py: get_data_frame uses pyspark's arrow collector to connect python cells to pandas dataframes. We might be able to get away with a direct dependency on arrow itself. (Possibly something to fix along with #92 )
  • info.vizierdb.commands.python.PythonUDFBuilder: pyspark.cloudpickle is used to serialize python functions for use with Spark.

The last point is going to be the major problem since unfortunately, pyspark has its own version of cloudpickle, and cloudpickle generates version-specific output (spark won't be able to read a function serialized by a different cloudpickle version). A few ideas:

  • Insert our own shim layer. e.g., invoke python functions via ScalaPy. This has the advantage of
  • Differentiate the system python from the cell execution python. Invoke cloudpickle with the system python (i.e., don't run it in a venv).

The latter is probably what makes the most sense.

okennedy avatar Jul 10 '22 17:07 okennedy