dataframe-api icon indicating copy to clipboard operation
dataframe-api copied to clipboard

[protocol] Add a ColumnSchema abstraction?

Open pitrou opened this issue 2 years ago • 0 comments

Instead of having individual methods to query the DType, categorical description, null description and metadata (which I suspect might be replicated at the DataFrame level?), how about adding a first-class abstraction to tie them together? For example:


class ColumnSchema(TypedDict):
    # the underlying physical representation
    dtype: DType
    # if the column is categorical, describes how to interpret the contents
    categorical_encoding: Optional[CategoricalDescription]
    # if the column supports null values, describes how they are represented
    null_encoding: Optional[Tuple[ColumnNullType, Any]]
    # arbitrary metadata attached to the column, possibly empty
    metadata: Dict[str, Any]

class Column(ABC):
    ...
    @property
    @abstractmethod
    def schema(self) -> ColumnSchema: ...

(IMHO, "encoding" sounds more precise than "description")

I'm also not sure why the spec uses a mix of Tuples and TypedDicts. Is it an attempt at optimizing Python object footprint?

pitrou avatar Sep 05 '23 10:09 pitrou