Pandas compatibility
Affects: PythonCall
Describe the bug
I have been trying to use pandas from PythonCall.jl and just wanted to document a few different calls that do not directly translate to Julia. I guess this might just mean we need a PythonPandas package to translate calls but I wonder if there's any missing methods that could be implemented to fix things automatically.
First, the preamble for this:
using PythonCall
pd = pyimport("pandas")
- [ ] 1. Constructing
pandas.DataFrame:
Using a similar syntax to Python:
df = pd.DataFrame(Dict([
"a" => [1, 2, 3],
"b" => [4, 5, 6]
]))
which results in the following dataframe:
julia> df
Python:
0
0 b
1 a
i.e., it seems to have a single column named "0" and rows for a and b.
If I instead write this as a vector of pairs, I get:
julia> pd.DataFrame([
"a" => [1, 2, 3],
"b" => [4, 5, 6]
])
Python:
0 1
0 a [1, 2, 3]
1 b [4, 5, 6]
I suppose this one makes sense.
I was able to get it working with the following syntax instead:
julia> df = pd.DataFrame([
1 4
2 5
3 6
], columns=["a", "b"])
Python:
a b
0 1 4
1 2 5
2 3 6
- [ ] 2. Selecting multiple columns
So, selecting a single column works:
julia> df["a"]
Python:
0 1
1 2
2 3
Name: a, dtype: int64
but multiple columns does not:
julia> df[["a", "b"]]
ERROR: Python: TypeError: Julia: MethodError: objects of type Vector{String} are not callable
Use square brackets [] for indexing an Array.
Python stacktrace:
[1] __call__
@ ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/any.jl:223
[2] apply_if_callable
@ pandas.core.common ~/Documents/pysr_projects/arya/bigbench/.CondaPkg/env/lib/python3.12/site-packages/pandas/core/common.py:384
[3] __getitem__
@ pandas.core.frame ~/Documents/pysr_projects/arya/bigbench/.CondaPkg/env/lib/python3.12/site-packages/pandas/core/frame.py:4065
Stacktrace:
[1] pythrow()
@ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:92
[2] errcheck
@ ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:10 [inlined]
[3] pygetitem(x::Py, k::Vector{String})
@ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/builtins.jl:171
[4] getindex(x::Py, i::Vector{String})
@ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/Py.jl:292
[5] top-level scope
@ REPL[18]:1
I got around this by inserting a pylist call:
julia> df[pylist(["a", "b"])]
Python:
a b
0 1 4
1 2 5
2 3 6
As you can see in the document, AbstractArray and AbstractDict are implicitly converted to wrapper objects on the Python call.
In the first case, you should use pydict function to convert a Julia's Dict to a Python's dict.
julia> df = pd.DataFrame(pydict(Dict("a" => [1, 2, 3], "b" => [4, 5, 6])))
Python:
b a
0 4 1
1 5 2
2 6 3
As in the first case, the necessity of the explicit call to the pylist function is required in the second case.
Thanks, that makes sense! I didn’t see pydict.
So should this be closed or is there anything that can be done automatically?
The issue is that pandas.DataFrame.__init__ explicitly checks if its argument is a dict and Py(::Dict) is not a dict (it's a juliacall.DictValue). The two options to make this work automatically are:
- Change the PythonCall conversion rules to convert Julia
Dictto Pythondict. I'm not inclined to change this. - Change
pandas.DataFrame.__init__to check if the argument is aabc.collections.Mappinginstead, which includes bothdictandjuliacall.DictValue.
I think requiring pylist to do the indexing is a similar issue - it checks for list rather than the more general abc.collections.Sequence, which includes both list and juliacall.VectorValue.
I think the solutions on pandas side sound like better options to me. I'm not sure if they have some edge cases which prevent them being more general... Like maybe some abc.collections.Sequence acting as a single key?
cross-posted here: https://github.com/pandas-dev/pandas/issues/58803