vizier-scala
vizier-scala copied to clipboard
Eliminate direct dependencies on pyspark
Per #58 we're going to want to create multiple environments... but pyspark is a really heavyweight dependency. We're going to want to remove it if possible. Specific places where it shows up:
-
client.py
:export_module
annotates functions with pyspark type annotations (#219 should eliminate this need) -
client.py
:get_data_frame
uses pyspark's arrow collector to connect python cells to pandas dataframes. We might be able to get away with a direct dependency onarrow
itself. (Possibly something to fix along with #92 ) -
info.vizierdb.commands.python.PythonUDFBuilder
:pyspark.cloudpickle
is used to serialize python functions for use with Spark.
The last point is going to be the major problem since unfortunately, pyspark
has its own version of cloudpickle
, and cloudpickle
generates version-specific output (spark won't be able to read a function serialized by a different cloudpickle
version). A few ideas:
- Insert our own shim layer. e.g., invoke python functions via ScalaPy. This has the advantage of
- Differentiate the
system
python from the cell execution python. Invoke cloudpickle with the system python (i.e., don't run it in a venv).
The latter is probably what makes the most sense.