altair
altair copied to clipboard
[ENH] Add Support for pd.Index, pd.Series, range to array type
try to deal with #2808 and #2877 add
for k, v in list(kwds.items()):
if k not in (list(ignore) + ["shorthand"]):
if isinstance(v, (pd.Series, pd.Index)):
kwds[k] = v.to_list()
elif isinstance(v, range):
kwds[k] = list(v)
else:
kwds.pop(k, None)
to convert to array type if it's pd.Series, pd.Index or range. Not sure if it's right approach to convert inside to_dict() but not before (maybe in _todict?)
It can work on some test cases mentioned:
import altair as alt
from vega_datasets import data
import random
import pandas as pd
import numpy as np
n = 100
df = pd.DataFrame({"Category": [np.random.choice(["A","B","C","D"]) for i in range(n)],
"Variable": [np.random.normal(0, 10) for i in range(n)]})
grouped = df.loc[:,['Category', 'Variable']] \
.groupby(['Category']) \
.median() \
.sort_values(by='Variable').index
chart = alt.Chart(df).mark_boxplot().encode(
x=alt.X("Category",sort=grouped),
y='Variable'
)
chart
import altair as alt
import pandas as pd
from vega_datasets import data
barley = data.barley()
barley['variety'] = pd.Categorical(
barley['variety'],
ordered=True,
categories=[
'Manchuria',
'No. 457',
'No. 462',
'No. 475',
'Glabron',
'Svansota',
'Velvet',
'Trebi',
'Wisconsin No. 38',
'Peatland'
]
)
chart = alt.Chart(barley).mark_bar().encode(
x=alt.X('variety', sort=barley['variety'].cat.categories), # This line needs manual conversion
y=alt.Y('sum(yield)'),
color='site:N'
)
chart
Thanks for working on this @ChiaLingWeng ! And for all you contributions recently! This will be a convenient addition.
I am not sure where would be the best place for this logic, and don't have time to dive deeper myself right now, but others might have thoughts on this.
Regarding the logic, I wonder if we can make it more general to support any iterable (anything that can be converted into a list). This would allow us to support other dataframe libraries to as well as tuples, numpy arrays, etc, without special tests for each. I think all values that altair can take pass to altair is either a string (column names), dictionary (alt.value, alt.condition, etc), or class (alt.Color, alt.X, etc). So maybe we can just do something like:
if ~isinstance(v, str | dict | type): # type is for class
list(v) # this works with any iterable, which includes series, indexes, arrays, etc
This throws an informative error too pointing out that the object is not an iterable. There might be some grave oversight here because I haven't tried things out, but the general idea of making this solution more general would be helpful I think.
@joelostblom Thanks for your suggestion! I will try to find if there's more general change for this.
Would
elif hasattr(v, '__iter__'):
kwds[k] = list(v)
be sufficient? Or is this too loose?
list
takes any iterable
: https://docs.python.org/3/library/stdtypes.html#list so I think we could do what @mattijn suggested. Instead of the hasattr
check, it's more idiomatic to use the typing.Iterable
protocol in an isinstance
check:
The issue with both of these approaches would be that they catch strings as well, which is why I went for explicitly stating the types not to iterate. But maybe checking if the object is an iterable and not a string would work well?
Good points @joelostblom and @binste! My first thought that it was OK to have a str
trueing as Iterable
, but since altair accepts different type of inputs for a single argument, this can likely lead to interference. Example:
import altair as alt
import pandas as pd
data = {'x': ['a', 'b', 'c', 'd', 'e', 'f'], 'y': [5, 3, 8, 4, 6, 2]}
df = pd.DataFrame(data)
alt.Chart(df).mark_bar().encode(
x=alt.X('x', sort=list('cdfabe')), # `sort='cdfabe'` can be come ['c', 'd', 'f', 'a', 'b', 'e']
y='y'
)
alt.Chart(df).mark_bar().encode(
x=alt.X('x', sort='y'), # `sort='y'` should not become ['y']
y='y'
)
Probably something you already realised, @joelostblom.
Is it a good idea to define a location with our defined -permissive, but constrained- input types? There we could define types such as ListLike
(but not str
) and maybe also others such as DataFrameLike
that can be accessed internally but also used as part of the public API?
Yes, exactly, that's the reason we need to avoid catching strings in this check. I would be in favor of developing our own types if there is a clear benefit over writing something like not isinstance(v, str) and isinstance(v, Iterable)
. I know numpy has an ArrayLike
type, but that incluced strings and can't be used with isinstance
in general (not sure if that would be the same if we developed our own ListLike
type). For DataFrameLike
I thought what we are doing currently it was enough to check for the dataframe exchange protocol dunder string? Unless you mean for the typing hints specifically, then I see the convenience of having ListLike
(or ArrayLike
and DataFrameLike
Not sure if this is strict enough, but from this discussion, I found pd.api.types.is_list_like() may be useful.
Thanks @ChiaLingWeng! That seems to be the behaviour we are after here as well, but then without depending on pandas itself. Maybe we can use the same implementation as adopted within pandas, but to me its not really obvious how this function is implemented (I only can find this).
@joelostblom, yes I was thinking to define these as a pubic type definitions that also can be used by users prior pushing their data into Altair.
Nice find @ChiaLingWeng ! I agree with @mattijn that we if we could re-implement this without having to import pandas that would be great (since it would help us make pandas an optional dependency for altair in the future).
I believe the actual pandas implementation is here because it is a Cython function. It seems to me that they use a similar approach to what we discussed here; essentially checking if it is an Iterable
and not a string, so something like:
from typing import Iterable
if not isinstance(v, str | dict | type | bytes) and isinstance(v, Iterable): # type is for class
v = list(v)
We could also do a try/catch
with list(v)
right away instead of the explicit Iterable
check , but maybe it is clearer with the explicit check as per the above. I don't know if we need any additional error handling; I think implementing this and then running the rest suite to see if something breaks would be the next step.