ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: expose `highest_precedence(*dtypes)`

Open NickCrews opened this issue 1 year ago • 2 comments

Is your feature request related to a problem?

I want to be able to unify the schemas of multiple tables.

Currently I have something like

from ibis.expr.datatypes import highest_precedence

def unify_schemas(
    schemas: Iterable[ibis.Schema | Mapping[str, Any]],
    *,
    how: Literal["error", "union", "intersection"] = "error",
    on_conflict: Literal["upcast", "error"] = "upcast",
) -> ibis.Schema:
    """Unify multiple schemas into one.

    Parameters
    ----------
    schemas
        The schemas to unify.
    how
        How to handle columns that are present in some schemas but not others.

        - "error": raise a ValueError
        - "union": keep all columns
        - "intersection": only keep columns that are in all schemas
    on_conflict
        What to do when schemas have a column with the same name, but different types.
        Options are:

        - "upcast": upcast the column to the most general type
        - "error": raise a ValueError
    """
    schemas = [ibis.schema(schema) for schema in schemas]
    column_sets = [set(schema) for schema in schemas]
    union = set().union(*column_sets)
    if how == "error":
        for schema in schemas:
            missing = union - set(schema)
            if missing:
                raise ValueError(
                    f"missing columns {missing} from schema {schema}", missing
                )
        out_columns = union
    elif how == "union":
        out_columns = union
    elif how == "intersection":
        out_columns = union.intersection(*column_sets)
    else:
        raise ValueError(f"unknown how: {how}")

    out_schema = {}
    errors = []
    for col in out_columns:
        types = {schema[col] for schema in schemas if col in schema}
        if on_conflict == "error":
            if len(types) > 1:
                errors.append((col, types))
            else:
                typ = next(iter(types))
        elif on_conflict == "upcast":
            typ = highest_precedence(types)
        else:
            raise ValueError(f"unknown on_conflict: {on_conflict}")
        out_schema[col] = typ
    if errors:
        raise ValueError(f"conflicting types: {errors}")
    return ibis.schema(out_schema)

Note that I have to do the import of highest_precedence()

What is the motivation behind your request?

No response

Describe the solution you'd like

Maybe DataType.highest_precendence(*others: DataType)? A top-level API like ibis.highest_dtype() also would be reasonable, but this seems like rare enough of a need that I don't really want to pollute the top-level namespace with it.

What version of ibis are you running?

main

What backend(s) are you using, if any?

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

NickCrews avatar Jun 08 '24 00:06 NickCrews

Thanks for the issue!

Can clarify what you're asking for here? Is it just to "officialize" the API?

cpcloud avatar Jun 12 '24 17:06 cpcloud

yup, just include it in the docs so that we know it is a stable(ish) API. No functional changes needed.

NickCrews avatar Jun 12 '24 18:06 NickCrews