polars
polars copied to clipboard
Explicitly ordered Nominal and Ordinal Categorical Variables
Motivation
This request is motivated by the desire to switch the internal dataframe representation in plotnine from pandas to polars.
The Problem
polars cannot create nominal or ordinal categoricals. If categorical are not of these kinds, plotnine does not know:
- How to arrange discrete variable along an axis
- How to arrange the keys in a legend
- How to arrange the panels of a facetted plot
- How to choose the appropriate colour scale for a categorical variable
Note that, even if nominal categorical variables have no inherent order, the order in which the categories are declared is useful.
Demonstration
We have a dataset movies with two columns genre which can be a nominal categorical and rating which can be an ordinal categorical.
Without this distinction between categoricals, we get a visualisation that looks like this.
a)
compared to one that would look like this
b)
In a) the plotting (computation) system is denied information to make some meaningful presentation choices.
In b), the ordinal rating categorical on the x-axis is displayed in ascending order and because we chose to colour the bars an ordered colour scale was automatically chosen. Second, the nominal genre categorical is ordered based on the size of each category and the plot panels are automatically arranged base on this order which makes the pattern clear compared to a).
The demo code is available here
Generalisation
In our specific case we have
data -> plotting system -> graphic
other specific cases could be
data -> summarising function -> summary output
data -> tabulating function -> table
but the general pattern is
data -> computation -> results
Conclusion
Categorical variables should be nominal or ordinal and each should have an explicit ordering to allow routines that compute on them to make the best choices when presenting the results.
Hi @has2k1. Have you considered the Enum type? This type allows you to define the categories and the order yourself. It should be much preferred as this gives full control over the ordering.
In fact, we are trying to move away from the concept of ordering in Categoricals other than "lexical".
Yes I have seen Enum, but I'm not sure where the implementation is headed.
-
Enumis a nominal categorical. Is it planned that ordinal categoricals will also be expressed through anEnumas well. e.g. given this dataimport polars as pl lst = ["mid", "low", "high", "mid"] en = pl.Series(lst, dtype=pl.Enum(("low", "mid", "high")))encan/should be ordinal categorical. -
An
Enumdoes not round trip well through pandas.en_rt = pl.from_pandas(en.to_pandas())en_rtis aCategoricaltype but it is equivalent to the enum.assert en.dtype.base_type() == pl.Enum # True assert en_rt.dtype.base_type() == pl.Categorical # True assert en.equals(en_rt) # True ! assert all(en.cat.get_categories() == en_rt.cat.get_categories()) # True ! assert not en_rt.cat.uses_lexical_ordering() # True !That means
en_rtis aCategoricalwith physical ordering, but the order (as returned byget_categories) is not physical! Though it isn't said in the documentation that it should be physical.
In fact, we are trying to move away from the concept of ordering in Categoricals other than "lexical".
I think for categoricals, both "lexical" and "physical" can be convenient when the variable is being initialised, there after the order should be explicit. And there should be option for an explicit ordering when initialising. If this is not the case then there will be categoricals that can exist (after manipulations), but cannot be initialised into existence. For example
lst = list("ABAZ")
c = pl.Series(lst, dtype=pl.Categorical(ordering="lexical"))
c2 = c[:-1]
assert c2.to_list() == ["A", "B", "A"]
assert c2.cat.get_categories().to_list() == ["A", "B", "Z"]
You cannot create categorical c2 directly. Same thing if c had "physical" ordering. This means the initialisation API is lacking. But "physical" ordering has a certain oddness because when dataframe with such a column is sorted by another column, the physical order of the column changes but it maintains the hidden state or the original physical order.
data = pl.DataFrame({
"v1": list("CAB"),
"v2": pl.Series(list("CAB"), dtype=pl.Categorical(ordering="physical"))
})
data_sorted = data.sort(by="v1")
v2 = data_sorted["v2"]
assert v2.to_list() == ["A", "B", "C"]
assert not v2.cat.uses_lexical_ordering()
assert v2.cat.get_categories().to_list() == ["C", "A", "B"]
Again, you cannot instantiate a variable equivalent to v2.
Enum is a nominal categorical. Is it planned that ordinal categoricals will also be expressed through an Enum as well. e.g. given this data
In an Enum the data is ordinal as defined in the categories.
lst = ["mid", "low", "high", "mid"]
en = pl.Series(lst, dtype=pl.Enum(("low", "mid", "high")))
en.sort()
shape: (4,)
Series: '' [enum]
[
"low"
"mid"
"mid"
"high"
]
And if you want a different ordering, you can reorder the categories:
lexical_enum = pl.Enum(sorted(en.dtype.categories))
en = en.cast(lexical_enum)
en.sort()
We could also add a nomimal ordering, but I believe that would only mean we should raise an error if you try to sort them?
Going back from pandas.
Pandas doesn't have an Enum type, and thus it doesn't round trip completely.
However, we can still restore the Enum if we want.
categories = en_pd.dtype.categories.to_list()
pl.from_pandas(en_pd).cast(pl.Enum(categories))
Physical ordering
The whole physical ordering is something that will be removed. It is an implementation detail and is something that will break in streaming/ distributed workloads and is currently already flawed. We should order by something that is defined by the user, not by the accidental integers we assign to them.
I think for categoricals, both "lexical" and "physical" can be convenient when the variable is being initialised, there after the order should be explicit. And there should be option for an explicit ordering when initialising. If this is not the case then there will be categoricals that can exist (after manipulations), but cannot be initialised into existence. For example
I don't entirely follow here. Could you explain what you want and what you cannot do?
In any case. With Enum everything is predictable and the user can define the order statically.
We could also add a nominal ordering, but I believe that would only mean we should raise an error if you try to sort them?
Sorting nominal categoricals should be permitted. While nominals have no inherent order, for presentation purposes it can be convenient to have them in some order. For example, it can be helpful to sort a dataframe by a nominal categorical column (plus zero or more other columns).
I don't entirely follow here. Could you explain what you want and what you cannot do?
I was just pointing out a quirk of categorical values that can exist but cannot be declared as such in a single initialising expression.
Looking forward, if an Enum is an ordinal categorical. Would a nominal categorical be represented by a different type? Can pl.Categorical be the nominal?
My understanding is that plotnine acts on two pieces of order information for categoricals/enums:
- level ordering: used to decide display order.
- E.g. order of labels in a legend
- E.g. order for bars plotted for a categorical
- useful even when the categorical has no intrinsic order (e.g. order barchart based on height of bars)
- ordinal scale indicator: basically a True/False flag indicating that the levels have an intrinsic order
- E.g. used to decide whether ColorBrewer's sequential (ordinal), or qualitative (nominal) palette makes more sense.
It sounds like enums capture (1) by maintaining level orderings, but there is no indicator for (2) whether it's believed to be on the ordinal scale. Without (2) plotting software can't decide things like what color palettes make most sense.
Information on (2) could always be passed in to plotting software separately, but I think oftentimes keeping it on the columns helps reduce spaghetti code (similar to keeping level ordering on columns) 😓. It also has the advantage of signaling that certain comparison operations are sensible (e.g. >).
Examples
import pandas as pd
from plotnine import ggplot, aes, geom_col
ratings = pd.DataFrame({
"rating": ["bad", "average", "good"],
"n": [2, 5, 3]
})
# (1) level ordering ----
ratings["rating_ord"] = pd.Categorical(ratings["rating"], categories=["bad", "average", "good"])
# (2) ordinal scale indicator ----
ratings["rating_ord_ind"] = pd.Categorical(ratings["rating"], categories=["bad", "average", "good"], ordered=True)
Example of level ordering
# BAD: x-axis and legend order goes average, bad, good
ggplot(ratings, aes("rating", "n", fill="rating")) + geom_col()
# GOOD
ggplot(ratings, aes("rating_ord", "n", fill="rating_ord")) + geom_col()
| bad | good |
|---|---|
Example of ordinal scale indicator
# GOOD: infer on ordinal scale and use sequential color palette
ggplot(ratings, aes("rating_ord_ind", "n", fill="rating_ord_ind")) + geom_col()
Reframing discussion
I think a key here is that there are two pieces of information used. The first is just about ordering, and less about whether the data is intrinsically ordinal. The second piece, missing from enums, tells plotnine whether we also believe there's an intrinsic order (so things like a sequential palette makes sense).
Alright -- per @ritchie46 's suggestion, I prototyped using Enum (and called the prototype library catfact). However, I think Enum's dtype implementation x map_batches makes this impossible.
The issue is that an Enum's dtype is all of its categories (e.g. Enum(["z", "x", "y"]) is a dtype). So if I have a function that returns an Enum whose categories or their order aren't known ahead of time, I can't tell Polars the return_dtype. It looks like catfact maybe recently ran afoul of something in recent Polars versions, since its functions now raise a schema error :/.
See https://github.com/machow/catfact/issues/6
Any idea what might be a good move here? catfact works well over Pandas, but I'm not sure how to support Polars :(.
Python code to reproduce
import polars as pl
import catfact as fct
from catfact.polars.data import starwars
# ERROR: over polars expression (via map_batches: expected output type `Enum([...])`, got `Enum([...])`
starwars.with_columns(eye_color=fct.infreq(pl.col("eye_color")))
# OKAY: over polars series
fct.infreq(starwars["eye_color"])
# OKAY: over pandas series
fct.infreq(starwars.to_pandas()["eye_color"])
edit: I think it's because by default when inferring the dtype, the default value used is an empty Enum with the existing mapping: https://github.com/pola-rs/polars/blob/main/crates/polars-core/src/datatypes/any_value.rs#L202