pandas icon indicating copy to clipboard operation
pandas copied to clipboard

ENH: generalize `__init__` on a `dict` to `abc.collections.Mapping` and `__getitem__` on a `list` to `abc.collections.Sequence`

Open MilesCranmer opened this issue 1 year ago • 6 comments

Feature Type

  • [ ] Adding new functionality to pandas

  • [X] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

I wish that DataFrame was more compatible with initializing from more general classes of dictionaries, and being indexed by more general classes of sequences.

Feature Description

Initialization

This is motivated by attempting to use Pandas from Julia via PythonCall.jl, where I found the strange behavior where:

using PythonCall

pd = pyimport("pandas")

df = pd.DataFrame(Dict([
    "a" => [1, 2, 3],
    "b" => [4, 5, 6]
]))

would result in the following dataframe:

julia> df
Python:
   0
0  b
1  a

As @cjdoris discovered, this this is due to the fact "pandas.DataFrame.__init__ explicitly checks if its argument is a dict and Py(::Dict) is not a dict (it becomes a juliacall.DictValue in Python)."

The way to fix this would be to have __init__ instead check for its input being an instance of abc.collections.Mapping instead, which includes both dict and abc.collections.Mapping.

Indexing

The second incompatibility here is that indexing using lists that are not exactly list does not work as expected. For example, if I try to index a pandas dataframe using a sequence of strings Julia, I get this error:

julia> df[["a", "b"]]
ERROR: Python: TypeError: Julia: MethodError: objects of type Vector{String} are not callable
Use square brackets [] for indexing an Array.
Python stacktrace:
 [1] __call__
   @ ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/any.jl:223
 [2] apply_if_callable
   @ pandas.core.common ~/Documents/pysr_projects/arya/bigbench/.CondaPkg/env/lib/python3.12/site-packages/pandas/core/common.py:384
 [3] __getitem__
   @ pandas.core.frame ~/Documents/pysr_projects/arya/bigbench/.CondaPkg/env/lib/python3.12/site-packages/pandas/core/frame.py:4065
Stacktrace:
 [1] pythrow()
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:92
 [2] errcheck
   @ ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:10 [inlined]
 [3] pygetitem(x::Py, k::Vector{String})
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/builtins.jl:171
 [4] getindex(x::Py, i::Vector{String})
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/Py.jl:292
 [5] top-level scope
   @ REPL[18]:1

As @cjdoris found, this is due to __getitem__ checking for list.

The solution would be to check for the more general abc.collections.Sequence, which includes both list and juliacall.VectorValue.

Alternative Solutions

Alternative solution 1

There is https://github.com/JuliaData/DataFrames.jl which implements a dataframe type from scratch, but pandas is more actively maintained, and I find it easier to use, so would like to call it from Julia.

Alternative solution 2

The current workaround is to use these operations as follows (thanks to @mrkn)

julia> df = pd.DataFrame(pydict(Dict("a" => [1, 2, 3], "b" => [4, 5, 6])))
Python:
   b  a
0  4  1
1  5  2
2  6  3

where we explicitly convert to a Python dict object. However, this is not obvious, and there is no error when you avoid pydict - it's only from checking the contents do you see it resulted in an unexpected dataframe.

Similarly, for indexing:

julia> df[pylist(["a", "b"])]
Python:
   a  b
0  1  4
1  2  5
2  3  6

However, this is not immediately obvious, and is a bit verbose. My immediate thought is that passing a list-like object to pandas getitem should work for indexing, and passing a dict-like object should work for initialization. Perhaps this is subjective but I thought I'd see what others think.

Checking the more general class of types would fix this, and also be compatible with any other list-like or dict-like inputs.

Alternative solution 3

As brought up by @cjdoris in https://github.com/JuliaPy/PythonCall.jl/issues/501, one option is to change PythonCall so that it automatically convert Julia Dict to Python dict whenever passed to a Python object.

However, this would prevent people from mutating Julia dicts from within Python code, so would be a massive change in behavior. And converting Julia Vectors to Python lists automatically would also prevent mutation from Python, so is basically a non-option.

Additional Context

Discussed in https://github.com/JuliaPy/PythonCall.jl/issues/501

MilesCranmer avatar May 21 '24 17:05 MilesCranmer

I’m -0.5 on this. Internally we would just convert to dict/list anyway. I’d rather users do that where necessary and avoid the perf penalty in the cases we do supporr

jbrockmendel avatar May 22 '24 00:05 jbrockmendel

Ah, there would be a performance penalty? I would have thought it would just be changing an isinstance(i, list) to isinstance(i, Sequence)?

MilesCranmer avatar May 22 '24 02:05 MilesCranmer

Slightly unrelated, but have you tried the Julia pandas wrapper? I believe it automatically does the conversion from Julia Vector/Dict to a Python list/dict for you

Aloqeely avatar May 22 '24 03:05 Aloqeely

Thanks, sadly that one looks to use the older PyCall.jl instead of the newer PythonCall.jl, so not (yet) compatible.

It's not a big deal if not possible. I guess it's a bit of a sharp edge, especially as it doesn't throw an error, but for users who google this, the workaround does work. I just thought it seemed like it might be more duck-typey/pythonic if any dict-like input was acceptable for initializing from a dict (and similar for sequences) rather than only explicit dicts – I could imagine other cases where it might be useful to have this. But I understand this is totally subjective!

MilesCranmer avatar May 22 '24 06:05 MilesCranmer

@jbrockmendel I checked the performance degradation by accepting Mapping. The asv benchmark said BENCHMARKS NOT SIGNIFICANTLY CHANGED. See https://github.com/mrkn/pandas/pull/1 for more details and the patch I applied.

Note that I don't have much knowledge of the pandas internals so this patch can be insufficient to accept Mapping to create a dataframe.

mrkn avatar May 22 '24 07:05 mrkn

I’m -0.5 on this. Internally we would just convert to dict/list anyway. I’d rather users do that where necessary and avoid the perf penalty in the cases we do supporr

In the case of dict -> Mapping this isn't true - the dataframe constructor calls dict_to_mgr(data, ...) (https://github.com/pandas-dev/pandas/blob/2aa155ae1cf019e902441463d1af873eaa3c803b/pandas/core/frame.py#L763) which internally just calls data.keys() and data[key] for each key and converts them to some other internal data structure - so it doesn't matter whether the source is a dict or any other Mapping.

AFAICT the only performance concern is in the actual type check isinstance(data, dict) -> isinstance(data, Mapping) which I presume is negligible (and backed up by mrkn's post).

I haven't looked at the indexing code to see if the same conclusions hold for list -> Sequence there.

cjdoris avatar May 22 '24 16:05 cjdoris

I made a pull-request to propose introduce Mapping support in DataFrame construction https://github.com/pandas-dev/pandas/pull/58814.

mrkn avatar May 23 '24 06:05 mrkn

#58814 only fixes the issue of constructing a DataFrame from a Mapping, but using a Mapping can still work unexpectedly for other methods, take this example from the docs (using #58814's DictWrapper class):

df = pd.DataFrame({"num_legs": [2, 4], "num_wings": [2, 0]}, index=["falcon", "dog"])

values = {"num_wings": [0, 3]}
my_dict = DictWrapper(values)  # <-- Mapping

print(df.isin(values))  # Correct result
print(df.isin(my_dict))  # Wrong result

A quick search shows 100+ results of isinstance(_, dict) checks and also 100+ isinstance(_, list) checks, so I think if we're going to support Mapping for DataFrame construction then we'd have to support it anywhere else.

Aloqeely avatar May 23 '24 10:05 Aloqeely

@MilesCranmer Regarding __getitem__, the error in this method is not due to checking for list. The real cause is that juliacall.VectorValue is callable. The traceback indicating that the error occurred in com.apply_if_callable is evidence of this.

This means we cannot support juliacall.VectorValue simply by accepting Sequence in addition to list.

mrkn avatar May 27 '24 06:05 mrkn

I guess the following change, skipping the call to the object if an error occurs, could be acceptable for improving pandas's interoperability with other libraries.

diff --git a/pandas/core/common.py b/pandas/core/common.py
index 9629199122..31442d9f40 100644
--- a/pandas/core/common.py
+++ b/pandas/core/common.py
@@ -385,7 +385,10 @@ def apply_if_callable(maybe_callable, obj, **kwargs):
     **kwargs
     """
     if callable(maybe_callable):
-        return maybe_callable(obj, **kwargs)
+        try:
+            return maybe_callable(obj, **kwargs)
+        except Exception:
+            pass

     return maybe_callable

After passing com.apply_if_callable, the key seems to be checked by the is_list_like function, and it accepts juliacall.VectorValue:

julia> using PythonCall

julia> pd = pyimport("pandas")
Python: <module 'pandas' from '/home/mrkn/.pyenv/versions/3.11.7/lib/python3.11/site-packages/pandas/__init__.py'>

julia> pd._libs.lib.is_list_like(["a", "b"])
Python: True

mrkn avatar May 27 '24 07:05 mrkn

so I think if we're going to support Mapping for DataFrame construction then we'd have to support it anywhere else.

Definitely -1 on this. The request adds a lot of maintenance burden for a use case that isn't remotely supported.

jbrockmendel avatar May 31 '24 17:05 jbrockmendel

The request adds a lot of maintenance burden for a use case that isn't remotely supported.

I'm -1 on this entirely, the point I wanted to emphasize is that allowing Mapping in the DataFrame constructor does not fully solve this problem, so why bother fixing some parts only?

Aloqeely avatar May 31 '24 17:05 Aloqeely

then we'd have to support it anywhere else

I think the easiest approach would just be to convert it to a dict or list internally, so you would not need to do those checks.

I would of course not expect you to support Mapping throughout the entire library, my apologies if that is what it came off like.

The request adds a lot of maintenance burden

I think if you simply convert incoming types to a dict if they are Mapping-like, then it should be fairly low-cost, and would improve Pandas extensibility to other libraries that inherit from Python's Mapping abstract type, which defines what interfaces they support: https://docs.python.org/3/library/collections.abc.html.

a use case that isn't remotely supported.

While I of course don't expect support for my use-case, I just want to gently suggest that this sort of change seems like the "right thing to do" at some level. The fact Julia is running into errors (despite implementing the abstract class correctly according to the Python docs) is moreso a signal of a general interface mismatch, rather than a specific library to support.

For example, if an incoming object inherits from Python's abstract Mapping class, then pandas __init__ may want to treat it like a mapping, rather than the current behavior seen below. Only checking dict means you end up treating dict-like objects as if they were something else:

import pandas as pd
from collections.abc import Mapping

class MyMapping(Mapping):
    def __init__(self, **kwargs):
        self.d = dict(**kwargs)
    def __getitem__(self, k):
        return self.d[k]
    def __iter__(self):
        return iter(self.d)
    def __len__(self):
        return len(self.d)

d = MyMapping(a=[1, 2], b=[3, 4])

df = pd.DataFrame(d)

print(df)

This results in:

   0
0  a
1  b

which is unexpected.

The Python docs describe this class as:

A container object that supports arbitrary key lookups and implements the methods specified in the collections.abc.Mapping or collections.abc.MutableMapping abstract base classes. Examples include dict, collections.defaultdict, collections.OrderedDict and collections.Counter.

So I think it's reasonable to check for this. Or at the very least throw an error instead of producing unexpected behavior like seen above.

Of course as the maintainers the decision is up to you.

MilesCranmer avatar May 31 '24 18:05 MilesCranmer

Looks like there was also thread a while ago on this: https://github.com/pandas-dev/pandas/issues/6792 (thanks for linking @mroeschke)

MilesCranmer avatar Jun 01 '24 00:06 MilesCranmer