ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: first-class support for enum datatype

Open NickCrews opened this issue 8 months ago • 2 comments

Is your feature request related to a problem?

This would be quite a large change. I understand if you just want to squash teh whole idea.

In both my postgres database and my motherduck databases that I am working with, I have some columns as enum datatypes. This helps with both performance and data integrity.

However, when I am using these database with ibis, I lose a lot of the type safety.

eg if I have an enum in the database with values 'LOW', 'MEDIUM', and 'HIGH', then I can (and accidentally have) done something like t.filter(t.priority == 'L'), when I really meant to do 'LOW'. But this silently gives me the wrong result. It would be great if that errored at expression construction time. Not as good, but still an improvement, would be an error at execution time.

The other motivation is for actually creating tables, eg if I do conn.create_table("my_table", schema) then it would be nice if it actually used the underlying enum type.

What is the motivation behind your request?

No response

Describe the solution you'd like

This might require emitting some DDL statements like CREATE YPE IF NOT EXISTS autogenerated_enum_name_123 AS ENUM ('val1', 'val2', 'val3'). I think we already do this for struct types on postgres, so maybe this wouldn't be a huge lift.

These are some examples of what I'd want to be able to do:

import enum
import ibis
from ibis.expr import datatypes as dt

class Priority(enum.StrEnum):
    LOW = "LOW"
    MEDIUM = "MEDIUM"
    HIGH = "HIGH"

t = ibis.table(schema={"priority": Priority})
t = ibis.table(schema={"priority": ibis.dtype(Priority)})
t = ibis.table(schema={"priority": dt.Enum(Priority)})
t = ibis.table(schema={"priority": "enum<LOW, MEDIUM, HIGH>"})

t.filter(t.priority == Priority.LOW) # This is the encouraged way because of better IDE completion and refactoring
t.filter(t.priority == "LOW")
t.filter(t.priority == "L") # error
t.priority.lower()  # error, in general you can't treat them as strings

duckdb supports ordering enum values, which would allow us to do table.priority > 'LOW', but for now, I would say we should punt on actually implementing this, but if it's not too hard, leaving the door open for us to implement this later (eg storing the enum members in order in the datatype instead of storing them in an eg frozenset)

What version of ibis are you running?

main

What backend(s) are you using, if any?

No response

Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

NickCrews avatar Mar 13 '25 00:03 NickCrews

I would also love to have support for enum types, mainly for performance reasons. My understanding is that duckdb operations can be much more performant on enum datatypes, and support would make sense given that duckdb is the default backend.

enum support would also let users "roundtrip" categorical data types, preserving data structure between pandas, R, and parquet.

lboller-pwbm avatar Mar 18 '25 21:03 lboller-pwbm

A good first step would be to allow reading enums from all backends as strings. I thought this already worked, since it does on duckdb

Details

import duckdb
import ibis

con = duckdb.connect()
con.sql("CREATE TYPE my_enum AS ENUM ('a', 'b', 'c')")
con.sql("CREATE TABLE my_table (id INTEGER, my_enum_col my_enum)")
con.sql("INSERT INTO my_table VALUES (1, 'a'), (2, 'b'), (3, 'c')")
be = ibis.duckdb.from_connection(con)
be.table("my_table")
# ┏━━━━━━━┳━━━━━━━━━━━━━┓
# ┃ id    ┃ my_enum_col ┃
# ┡━━━━━━━╇━━━━━━━━━━━━━┩
# │ int32 │ string      │
# ├───────┼─────────────┤
# │     1 │ a           │
# │     2 │ b           │
# │     3 │ c           │
# └───────┴─────────────┘

but I just began querying a postgres database that has some enums in it, and the ibis datatype I get is unknown.

NickCrews avatar Mar 21 '25 09:03 NickCrews

Enum support would be helpful to avoid data loss of categories when using ibis-> pandas in my machine learning workflow.

glilienthal avatar Sep 05 '25 17:09 glilienthal

@glilienthal can you elaborate more on your workflow?

  • what backend are you using?
  • Are you reading native enums from that backend that already exist in that backend (eg someone already previously executed "CREATE TYPE ENUM MY_TYPE..." in the duckdb backend) and we need to be able to infer this, or you want to create this enum on the fly in the backend because it shouldn't exist there long-term.
  • Do you require ordering semantics on the ibis side and/or on the returned pandas dtype?

NickCrews avatar Sep 06 '25 16:09 NickCrews

I'm thinking that to implement this on the backend, we need to think more about if we want to reuse the existing enum type in the backend, or create a ephemeral enum type for every execution. Eg if you are reading some native column in duckdb with an enum type MYDB.MYSCHEMA.MYTYPE, do we reuse/reference that throughout the query? I think if we did this, then in the Datatype instance we would actually need to hold a reference to what the dtype is called in each individual backend connection. This is different from all other ibis dtypes, which are "stateless"/backend-agnostic. So I think that is a bad idea. Instead the ibis dtype should just hold the enum members, eg LOW, MEDIUM, HIGH, and then as a pre-execute hook it creates this temp enum type, then it should be possible to cast the source columns to this type as a first step, use this temp type throughout the query, and then drop the enum as a post-execute hook?

ugh, but then that would be not great for if you create a table from that query. eg (maybe not valid SQL)

con.raw_sql("CREATE TABLE src (my_col MY_NATIVE_ENUM)")
t = con.table("t")
renamed = t.select(new_col="my_col")
con.create_table("dst", renamed)

After this I would hope that the type of new_col in the dst table is the exact same as the my_col in the src table. Not sure what to do about this

NickCrews avatar Sep 06 '25 16:09 NickCrews

HI @NickCrews Thank you for coming back & sorry for missing this initially.

@glilienthal can you elaborate more on your workflow?

* what backend are you using?

duckdb

* Are you reading native enums from that backend that already exist in that backend (eg someone already previously executed "CREATE TYPE ENUM MY_TYPE..." in the duckdb backend) and we need to be able to infer this, or you want to create this enum on the fly in the backend because it shouldn't exist there long-term.

I want to read existing enums in the backend and have them treated as enums in ibis. (and pandas I export too.)

I'd like enums/categories a) because they are faster than strings, b) because they need less space than strings, c) they are needed for decisiontrees, d) they make a difference for some plotting functiosn

* Do you require ordering semantics on the ibis side and/or on the returned pandas dtype?

yes, ordering is important (e.g. for plotting) especially on the pandas.

glilienthal avatar Oct 29 '25 09:10 glilienthal