PyCall.jl icon indicating copy to clipboard operation
PyCall.jl copied to clipboard

pyjlwrap support for Pickle Serialization

Open dmoliveira opened this issue 4 years ago • 5 comments

PyCall works nicely for many use cases between Python and Julia. In particular, there is one that could be improved and very important for Data Scientist community. For example, I tried to use it for PySpark library and works very well for the basic use case. But, if the user needs to create a UDF (User Defined Functions), the user will have trouble to serialize the functions. The UDFs, in this case, would help to many DSs reuse Julia code and call spark to do the heavy work. Have this enabled, would improve the usage of Julia in different scenarios.

To solve the current issues with UDF, PyObject needs to be serializable with Pickle. I don't have much idea how to solve this, but I have a simple use case that if we fix would improve towards this functionality:

Example:

using PyCall
pickle = pyimport("pickle")
pickle.dumps(x -> x + 1)

Error:

ERROR: PyError ($(Expr(:escape, :(ccall(#= /root/.julia/packages/PyCall/zqDXB/src/pyfncall.jl:43 =# @pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, pyargsptr, kw))))) <class 'TypeError'>
TypeError("cannot pickle 'PyCall.jlwrap' object")

Stacktrace:
 [1] pyerr_check at /root/.julia/packages/PyCall/zqDXB/src/exception.jl:60 [inlined]
 [2] pyerr_check at /root/.julia/packages/PyCall/zqDXB/src/exception.jl:64 [inlined]
 [3] _handle_error(::String) at /root/.julia/packages/PyCall/zqDXB/src/exception.jl:81
 [4] macro expansion at /root/.julia/packages/PyCall/zqDXB/src/exception.jl:95 [inlined]
 [5] #110 at /root/.julia/packages/PyCall/zqDXB/src/pyfncall.jl:43 [inlined]
 [6] disable_sigint at ./c.jl:446 [inlined]
 [7] __pycall! at /root/.julia/packages/PyCall/zqDXB/src/pyfncall.jl:42 [inlined]
 [8] _pycall!(::PyObject, ::PyObject, ::Tuple{var"#3#4"}, ::Int64, ::Ptr{Nothing}) at /root/.julia/packages/PyCall/zqDXB/src/pyfncall.jl:29
 [9] _pycall!(::PyObject, ::PyObject, ::Tuple{var"#3#4"}, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /root/.julia/packages/PyCall/zqDXB/src/pyfncall.jl:11
 [10] (::PyObject)(::Function; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /root/.julia/packages/PyCall/zqDXB/src/pyfncall.jl:86
 [11] (::PyObject)(::Function) at /root/.julia/packages/PyCall/zqDXB/src/pyfncall.jl:86
 [12] top-level scope at REPL[13]:1

Reference to UDF in Python: https://docs.databricks.com/spark/latest/spark-sql/udf-python.html

dmoliveira avatar Nov 14 '20 04:11 dmoliveira

In other words, you want to serialize Julia objects (wrapped in Python objects) via Pickle.

I guess we could do this by embedding the Julia serialization format (via the Serialization stdlib) in pickle?

stevengj avatar Nov 16 '20 15:11 stevengj

Exactly @stevengj . How we can accomplish this? Could you provide some guidance, please?

dmoliveira avatar Nov 16 '20 21:11 dmoliveira

I think it involves overloading __getstate__ and __setstate__ (https://docs.python.org/3/library/pickle.html#object.getstate), but I would have to do a bit of reading on pickle and how it interacts with the C api.

stevengj avatar Nov 16 '20 22:11 stevengj

Or rather, we probably want the lower-level __reduce__ interface (https://docs.python.org/3/library/pickle.html#object.reduce), which is more error-prone but will give us more control.

stevengj avatar Nov 16 '20 22:11 stevengj

Great, @stevengj if we can overcome this, would be a huge step for the Julia community and would be glad to publish an article showing this new awesome feature!

dmoliveira avatar Nov 16 '20 22:11 dmoliveira