polars icon indicating copy to clipboard operation
polars copied to clipboard

Explicitly ordered Nominal and Ordinal Categorical Variables

Open has2k1 opened this issue 1 year ago • 5 comments

Motivation

This request is motivated by the desire to switch the internal dataframe representation in plotnine from pandas to polars.

The Problem

polars cannot create nominal or ordinal categoricals. If categorical are not of these kinds, plotnine does not know:

  1. How to arrange discrete variable along an axis
  2. How to arrange the keys in a legend
  3. How to arrange the panels of a facetted plot
  4. How to choose the appropriate colour scale for a categorical variable

Note that, even if nominal categorical variables have no inherent order, the order in which the categories are declared is useful.

Demonstration

We have a dataset movies with two columns genre which can be a nominal categorical and rating which can be an ordinal categorical.

Without this distinction between categoricals, we get a visualisation that looks like this.

a)

Image

compared to one that would look like this

b) Image

In a) the plotting (computation) system is denied information to make some meaningful presentation choices.

In b), the ordinal rating categorical on the x-axis is displayed in ascending order and because we chose to colour the bars an ordered colour scale was automatically chosen. Second, the nominal genre categorical is ordered based on the size of each category and the plot panels are automatically arranged base on this order which makes the pattern clear compared to a).

The demo code is available here

Generalisation

In our specific case we have

data -> plotting system -> graphic

other specific cases could be

data -> summarising function -> summary output
data -> tabulating function -> table

but the general pattern is

data -> computation -> results

Conclusion

Categorical variables should be nominal or ordinal and each should have an explicit ordering to allow routines that compute on them to make the best choices when presenting the results.

has2k1 avatar Oct 01 '24 10:10 has2k1

Hi @has2k1. Have you considered the Enum type? This type allows you to define the categories and the order yourself. It should be much preferred as this gives full control over the ordering.

In fact, we are trying to move away from the concept of ordering in Categoricals other than "lexical".

ritchie46 avatar Oct 01 '24 12:10 ritchie46

Yes I have seen Enum, but I'm not sure where the implementation is headed.

  1. Enum is a nominal categorical. Is it planned that ordinal categoricals will also be expressed through an Enum as well. e.g. given this data

    import polars as pl
    lst = ["mid", "low", "high", "mid"]
    en = pl.Series(lst, dtype=pl.Enum(("low", "mid", "high")))
    

    en can/should be ordinal categorical.

  2. An Enum does not round trip well through pandas.

    en_rt = pl.from_pandas(en.to_pandas())
    

    en_rt is a Categorical type but it is equivalent to the enum.

    assert en.dtype.base_type() == pl.Enum  # True
    assert en_rt.dtype.base_type() == pl.Categorical  # True
    assert en.equals(en_rt)  # True !
    assert all(en.cat.get_categories() == en_rt.cat.get_categories())  # True !
    assert not en_rt.cat.uses_lexical_ordering()  # True  !
    

    That means en_rt is a Categorical with physical ordering, but the order (as returned by get_categories) is not physical! Though it isn't said in the documentation that it should be physical.

In fact, we are trying to move away from the concept of ordering in Categoricals other than "lexical".

I think for categoricals, both "lexical" and "physical" can be convenient when the variable is being initialised, there after the order should be explicit. And there should be option for an explicit ordering when initialising. If this is not the case then there will be categoricals that can exist (after manipulations), but cannot be initialised into existence. For example

lst = list("ABAZ")
c = pl.Series(lst, dtype=pl.Categorical(ordering="lexical"))
c2 = c[:-1]
assert c2.to_list() == ["A", "B", "A"]
assert c2.cat.get_categories().to_list() == ["A", "B", "Z"]

You cannot create categorical c2 directly. Same thing if c had "physical" ordering. This means the initialisation API is lacking. But "physical" ordering has a certain oddness because when dataframe with such a column is sorted by another column, the physical order of the column changes but it maintains the hidden state or the original physical order.

data = pl.DataFrame({
    "v1": list("CAB"),
    "v2": pl.Series(list("CAB"), dtype=pl.Categorical(ordering="physical"))
})
data_sorted = data.sort(by="v1")
v2 = data_sorted["v2"]
assert v2.to_list() == ["A", "B", "C"]
assert not v2.cat.uses_lexical_ordering()
assert v2.cat.get_categories().to_list() == ["C", "A", "B"]

Again, you cannot instantiate a variable equivalent to v2.

has2k1 avatar Oct 01 '24 17:10 has2k1

Enum is a nominal categorical. Is it planned that ordinal categoricals will also be expressed through an Enum as well. e.g. given this data

In an Enum the data is ordinal as defined in the categories.

lst = ["mid", "low", "high", "mid"]
en = pl.Series(lst, dtype=pl.Enum(("low", "mid", "high")))
en.sort()
shape: (4,)
Series: '' [enum]
[
	"low"
	"mid"
	"mid"
	"high"
]

And if you want a different ordering, you can reorder the categories:

lexical_enum = pl.Enum(sorted(en.dtype.categories))
en = en.cast(lexical_enum)
en.sort()

We could also add a nomimal ordering, but I believe that would only mean we should raise an error if you try to sort them?

Going back from pandas.

Pandas doesn't have an Enum type, and thus it doesn't round trip completely.

However, we can still restore the Enum if we want.

categories = en_pd.dtype.categories.to_list()
pl.from_pandas(en_pd).cast(pl.Enum(categories))

Physical ordering

The whole physical ordering is something that will be removed. It is an implementation detail and is something that will break in streaming/ distributed workloads and is currently already flawed. We should order by something that is defined by the user, not by the accidental integers we assign to them.

I think for categoricals, both "lexical" and "physical" can be convenient when the variable is being initialised, there after the order should be explicit. And there should be option for an explicit ordering when initialising. If this is not the case then there will be categoricals that can exist (after manipulations), but cannot be initialised into existence. For example

I don't entirely follow here. Could you explain what you want and what you cannot do?

In any case. With Enum everything is predictable and the user can define the order statically.

ritchie46 avatar Oct 02 '24 09:10 ritchie46

We could also add a nominal ordering, but I believe that would only mean we should raise an error if you try to sort them?

Sorting nominal categoricals should be permitted. While nominals have no inherent order, for presentation purposes it can be convenient to have them in some order. For example, it can be helpful to sort a dataframe by a nominal categorical column (plus zero or more other columns).

I don't entirely follow here. Could you explain what you want and what you cannot do?

I was just pointing out a quirk of categorical values that can exist but cannot be declared as such in a single initialising expression.

Looking forward, if an Enum is an ordinal categorical. Would a nominal categorical be represented by a different type? Can pl.Categorical be the nominal?

has2k1 avatar Oct 04 '24 10:10 has2k1

My understanding is that plotnine acts on two pieces of order information for categoricals/enums:

  1. level ordering: used to decide display order.
    • E.g. order of labels in a legend
    • E.g. order for bars plotted for a categorical
    • useful even when the categorical has no intrinsic order (e.g. order barchart based on height of bars)
  2. ordinal scale indicator: basically a True/False flag indicating that the levels have an intrinsic order
    • E.g. used to decide whether ColorBrewer's sequential (ordinal), or qualitative (nominal) palette makes more sense.

It sounds like enums capture (1) by maintaining level orderings, but there is no indicator for (2) whether it's believed to be on the ordinal scale. Without (2) plotting software can't decide things like what color palettes make most sense.

Information on (2) could always be passed in to plotting software separately, but I think oftentimes keeping it on the columns helps reduce spaghetti code (similar to keeping level ordering on columns) 😓. It also has the advantage of signaling that certain comparison operations are sensible (e.g. >).

Examples

import pandas as pd
from plotnine import ggplot, aes, geom_col

ratings = pd.DataFrame({
    "rating": ["bad", "average", "good"],
    "n": [2, 5, 3]
})

# (1) level ordering ----
ratings["rating_ord"] = pd.Categorical(ratings["rating"], categories=["bad", "average", "good"])

# (2) ordinal scale indicator ----
ratings["rating_ord_ind"] = pd.Categorical(ratings["rating"], categories=["bad", "average", "good"], ordered=True)

Example of level ordering

# BAD: x-axis and legend order goes average, bad, good
ggplot(ratings, aes("rating", "n", fill="rating")) + geom_col()

# GOOD
ggplot(ratings, aes("rating_ord", "n", fill="rating_ord")) + geom_col()
bad good
image image

Example of ordinal scale indicator

# GOOD: infer on ordinal scale and use sequential color palette
ggplot(ratings, aes("rating_ord_ind", "n", fill="rating_ord_ind")) + geom_col()
image

Reframing discussion

I think a key here is that there are two pieces of information used. The first is just about ordering, and less about whether the data is intrinsically ordinal. The second piece, missing from enums, tells plotnine whether we also believe there's an intrinsic order (so things like a sequential palette makes sense).

machow avatar Oct 15 '24 19:10 machow

Alright -- per @ritchie46 's suggestion, I prototyped using Enum (and called the prototype library catfact). However, I think Enum's dtype implementation x map_batches makes this impossible.

The issue is that an Enum's dtype is all of its categories (e.g. Enum(["z", "x", "y"]) is a dtype). So if I have a function that returns an Enum whose categories or their order aren't known ahead of time, I can't tell Polars the return_dtype. It looks like catfact maybe recently ran afoul of something in recent Polars versions, since its functions now raise a schema error :/.

See https://github.com/machow/catfact/issues/6

Any idea what might be a good move here? catfact works well over Pandas, but I'm not sure how to support Polars :(.

Python code to reproduce

import polars as pl
import catfact as fct

from catfact.polars.data import starwars

# ERROR: over polars expression (via map_batches: expected output type `Enum([...])`, got `Enum([...])`
starwars.with_columns(eye_color=fct.infreq(pl.col("eye_color")))

# OKAY: over polars series
fct.infreq(starwars["eye_color"])

# OKAY: over pandas series
fct.infreq(starwars.to_pandas()["eye_color"])

edit: I think it's because by default when inferring the dtype, the default value used is an empty Enum with the existing mapping: https://github.com/pola-rs/polars/blob/main/crates/polars-core/src/datatypes/any_value.rs#L202

machow avatar Nov 07 '25 21:11 machow