vaex
vaex copied to clipboard
[BUG-REPORT] DataFrame describe method fails for List elements
Following up on https://github.com/vaexio/vaex/issues/2087#issuecomment-1163799755, thevaex.dataframe.DataFrameLocal.describe()
method does not work for list types. Seems those types are not included in map_arrow_to_numpy
.
data = {"A": [1], "B": pa.array([["a", "b", "c"]])}
df = vaex.from_dict(data)
df.describe()
Traceback (most recent call last):
File "Mac/python3.9/site-packages/vaex/array_types.py", line 327, in numpy_dtype_from_arrow_type
return map_arrow_to_numpy[arrow_type]
KeyError: ListType(list<item: int64>)
What do you expect to see as output btw, just count/missing values? e.g. No statistics?
This is the behavior that I would expect.
>>> import pandas as pd
>>> data = {"A": [1], "B": [["a", "b", "c"]]}
>>> pd.DataFrame(data).describe()
A
count 1.0
mean 1.0
std NaN
min 1.0
25% 1.0
50% 1.0
75% 1.0
max 1.0
>>> import pandas as pd
>>> data = {"A": [[1, 2, 3]], "B": [["a", "b", "c"]]}
>>> pd.DataFrame(data).describe()
A B
count 1 1
unique 1 1
top [1, 2, 3] [a, b, c]
freq 1 1
>>> data = {"B": [["a", "b", "c"]]}
>>> pd.DataFrame(data).describe()
B
count 1
unique 1
top [a, b, c]
freq 1
This will add significant overhead to describe
.. If you look at what describe
currently outputs, the "count" field is the only one we have in common.
I once had the idea to have describe
have additional arguments, so a user can specify if they want to have the n_unique
elements, and maybe as you suggest the most_frequent
and freq
, which by default would be disabled.
I would be happy with that, even outside the context of lists. It would take some time/effort, and not sure how popular describe
is.
In any case, feel free to open a PR on this!