PythonCall.jl icon indicating copy to clipboard operation
PythonCall.jl copied to clipboard

Pandas compatibility

Open MilesCranmer opened this issue 1 year ago • 6 comments

Affects: PythonCall

Describe the bug

I have been trying to use pandas from PythonCall.jl and just wanted to document a few different calls that do not directly translate to Julia. I guess this might just mean we need a PythonPandas package to translate calls but I wonder if there's any missing methods that could be implemented to fix things automatically.

First, the preamble for this:

using PythonCall

pd = pyimport("pandas")
  • [ ] 1. Constructing pandas.DataFrame:

Using a similar syntax to Python:

df = pd.DataFrame(Dict([
    "a" => [1, 2, 3],
    "b" => [4, 5, 6]
]))

which results in the following dataframe:

julia> df
Python:
   0
0  b
1  a

i.e., it seems to have a single column named "0" and rows for a and b.

If I instead write this as a vector of pairs, I get:

julia> pd.DataFrame([
           "a" => [1, 2, 3],
           "b" => [4, 5, 6]
       ])
Python:
   0          1
0  a  [1, 2, 3]
1  b  [4, 5, 6]

I suppose this one makes sense.

I was able to get it working with the following syntax instead:

julia> df = pd.DataFrame([
            1   4
            2   5
            3   6
       ], columns=["a", "b"])
Python:
   a  b
0  1  4
1  2  5
2  3  6
  • [ ] 2. Selecting multiple columns

So, selecting a single column works:

julia> df["a"]
Python:
0    1
1    2
2    3
Name: a, dtype: int64

but multiple columns does not:

julia> df[["a", "b"]]
ERROR: Python: TypeError: Julia: MethodError: objects of type Vector{String} are not callable
Use square brackets [] for indexing an Array.
Python stacktrace:
 [1] __call__
   @ ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/any.jl:223
 [2] apply_if_callable
   @ pandas.core.common ~/Documents/pysr_projects/arya/bigbench/.CondaPkg/env/lib/python3.12/site-packages/pandas/core/common.py:384
 [3] __getitem__
   @ pandas.core.frame ~/Documents/pysr_projects/arya/bigbench/.CondaPkg/env/lib/python3.12/site-packages/pandas/core/frame.py:4065
Stacktrace:
 [1] pythrow()
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:92
 [2] errcheck
   @ ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:10 [inlined]
 [3] pygetitem(x::Py, k::Vector{String})
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/builtins.jl:171
 [4] getindex(x::Py, i::Vector{String})
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/Py.jl:292
 [5] top-level scope
   @ REPL[18]:1

I got around this by inserting a pylist call:

julia> df[pylist(["a", "b"])]
Python:
   a  b
0  1  4
1  2  5
2  3  6

MilesCranmer avatar May 18 '24 17:05 MilesCranmer

As you can see in the document, AbstractArray and AbstractDict are implicitly converted to wrapper objects on the Python call.

In the first case, you should use pydict function to convert a Julia's Dict to a Python's dict.

julia> df = pd.DataFrame(pydict(Dict("a" => [1, 2, 3], "b" => [4, 5, 6])))
Python:
   b  a
0  4  1
1  5  2
2  6  3

As in the first case, the necessity of the explicit call to the pylist function is required in the second case.

mrkn avatar May 21 '24 01:05 mrkn

Thanks, that makes sense! I didn’t see pydict.

So should this be closed or is there anything that can be done automatically?

MilesCranmer avatar May 21 '24 05:05 MilesCranmer

The issue is that pandas.DataFrame.__init__ explicitly checks if its argument is a dict and Py(::Dict) is not a dict (it's a juliacall.DictValue). The two options to make this work automatically are:

  • Change the PythonCall conversion rules to convert Julia Dict to Python dict. I'm not inclined to change this.
  • Change pandas.DataFrame.__init__ to check if the argument is a abc.collections.Mapping instead, which includes both dict and juliacall.DictValue.

cjdoris avatar May 21 '24 11:05 cjdoris

I think requiring pylist to do the indexing is a similar issue - it checks for list rather than the more general abc.collections.Sequence, which includes both list and juliacall.VectorValue.

cjdoris avatar May 21 '24 11:05 cjdoris

I think the solutions on pandas side sound like better options to me. I'm not sure if they have some edge cases which prevent them being more general... Like maybe some abc.collections.Sequence acting as a single key?

MilesCranmer avatar May 21 '24 16:05 MilesCranmer

cross-posted here: https://github.com/pandas-dev/pandas/issues/58803

MilesCranmer avatar May 21 '24 17:05 MilesCranmer