strictly_typed_pandas icon indicating copy to clipboard operation
strictly_typed_pandas copied to clipboard

Schema and known column names

Open kopp opened this issue 2 years ago • 4 comments

We have the convention to create a class that holds the names of the columns of a DataFrame, e.g.

class ColumnNames:
  height = "height"
  weight = "weight"

CN = ColumnNames

df = pd.DataFrame({CN.height: [1.9, 1.7, 2.1], CN.weight: [33, 40, 41]})

If now everyone uses the column names from the class rather than raw strings, we a) get a proposal from the IDE what columns to expect (by simply typing CN.<TAB> etc if you use df[CN.weight] rather than df.weight) and b) can re-name columns.

I am wondering, whether we can either use a schema to get the column names or define the schema using column names.

If we look at the example of getting started you see that "id" appears twice: Once in the definition of the Schema and once as string in the DataFrame to create.

from strictly_typed_pandas import DataSet

class Schema:
    id: int
    name: str

def foo(df: DataSet[Schema]) -> DataSet[Schema]:
    # do stuff
    return df

df = DataSet[Schema]({"id": [1, 2, 3], "name": ["John", "Jane", "Jack"]})

I guess I would like to be able to change the last line to something like

CN = Schema.names

df = DataSet[Schema]({CN.id: [1, 2, 3], CN.name: ["John", "Jane", "Jack"]})

or even directly use the names

df = DataSet[Schema]({Schema.id: [1, 2, 3], Schema.name: ["John", "Jane", "Jack"]})

Alternatively, it would be interesting to be able to generate a Schema but in a "functional style" (like you can do with an enum):

Schema = DataSchema("Schema",  {"id": int, "name": str})
# or in my case
Schema = DataSchema("Schema",  {CN.id: int, CN.name: str})

Is such a feature available and I just overlooked it? If not, do you think it is a reasonable thing to provide?

kopp avatar Mar 13 '23 10:03 kopp

Nice! I've actually been thinking along the same lines.

In my current job, I'm working on a similar package right now for adding schemas to pyspark. One of the things it supports is exactly what you describe: using the column names from the schemas, to support auto-complete and easy refactoring.

It's not open source yet, but once it is, we should look into transferring certain functionality to strictly_typed_pandas as well.

Until then, the way I approached it (roughly):

  • Make all schemas import a Schema class.
    • Every schema someone defines needs to subclass this Schema
    • Backward compatibility: if people don't subclass their schemas from Schema, that's fine, but they'll miss out on all the nice features we're gonna add here.
  • Make a meta-class MetaSchema, add this to Schema
    • Schema will be the public interface, because subclassing is just simpler than meta-classing
  • In MetaSchema, add a definition to __getattribute__(), in which we return a string every time on of its attributes is called.

My time right now is a bit limited, I'd like to focus on open-sourcing the other package first. If you're willing to make a contribution, I can share a bit more of the code that we've used to do the above? Otherwise, I'll get back to it at a later point.

nanne-aben avatar Mar 13 '23 12:03 nanne-aben

Hi, the following code allows for that. Do we think that this is "it" already?

import pandas as pd
from typing import List, Type


class _SchemaMeta(type):
    """
    Metaclass for Schemas.
    """

    def __new__(metacls, cls, bases, classdict):
        # create the actual class
        schema_class = super().__new__(metacls, cls, bases, classdict)
        schema_class._member_names_ = list(classdict.get("__annotations__", {}).keys())
        return schema_class

    def __getattr__(cls, name):
        """
        Return the name of the member matching `name` if it had some annotation (i.e. is usable as schema) on it.
        """
        if name in cls._member_names_:
            return name
        else:
            raise AttributeError(
                f"Unknown attribute '{name}', please use one of {cls._member_names_}."
            )


class Schema(metaclass=_SchemaMeta):
    """
    Generic schema.

    Derive from this class to define new schemas.
    """

    pass


def get_columns(schema: Type[Schema]) -> List[str]:
    return schema._member_names_


class Employee(Schema):
    id: int
    name: str


df = pd.DataFrame({Employee.id: [1, 2, 3], Employee.name: ["John", "Jack", "Alfred"]})

assert list(df.columns) == get_columns(Employee)

kopp avatar Mar 14 '23 10:03 kopp

I am wondering whether we can go one step further: If we create a

DataSet[Employee]({Employee.id: [1, 2, 3], Employee.name: ["John", "Jack", "Alfred"]})

it would just be awesome, if we could check, that this dict is correctly typed. What I mean by that is: If the __init__ method of DataSet[Employee] could be typed to use a TypedDict, then mypy would warn us, if we did something like

DataSet[Employee]({Employee.id: [1, 2, 3], Employee.name: [1, 2, 3]})

where we right now only get a runtime error.

I tried to do that via the following

from __future__ import annotations

import pandas as pd
from typing import List, Type, TypeVar, TypedDict, Iterable, Dict, Generic


class _SchemaMeta(type):
    """
    Metaclass for Schemas.
    """

    def __new__(metacls, cls, bases, classdict):
        # create the actual class
        schema_class = super().__new__(metacls, cls, bases, classdict)
        schema_class._member_names_ = list(classdict.get("__annotations__", {}).keys())
        types: Dict[str, Type] = {
            name: Iterable[type_]
            for name, type_ in classdict.get("__annotations__", {}).items()
        }
        schema_class._expected_input_ = TypedDict("ExpectedInput", types)
        return schema_class

    def __getattr__(cls, name):
        """
        Return the name of the member matching `name` if it had some annotation (i.e. is usable as schema) on it.
        """
        if name in cls._member_names_:
            return name
        else:
            raise AttributeError(
                f"Unknown attribute '{name}', please use one of {cls._member_names_}."
            )


class Schema(metaclass=_SchemaMeta):
    """
    Generic schema.

    Derive from this class to define new schemas.
    """

    pass


def get_columns(schema: Type[Schema]) -> List[str]:
    return schema._member_names_


S = TypeVar("S", bound=Schema)


class DataSet(Generic[S]):
    def __init__(self, dataframe: pd.DataFrame):
        self.dataframe = dataframe

    @staticmethod
    def from_dict(dict_data: S._expected_input_) -> DataSet[S]:
        return DataSet[S](pd.DataFrame(dict_data))


class Employee(Schema):
    id: int
    name: str


ds = DataSet[Employee].from_dict(
    {Employee.id: [1, 2, 3], Employee.name: ["John", "Jack", "Alfred"]}
)

but unfortunately mypy complains with Name "S._expected_input_" is not defined.

What do you think about this idea and do you have a pointer for me to get this fixed?

kopp avatar Mar 14 '23 13:03 kopp

W.r.t. your first post: yes, that's it! If you make a PR we can add it in.

W.r.t. the second post: I like the idea! But yeah, I've thought about it too, never quite figured out how to do it. S._expected_input_ is only defined during run-time, so mypy will never know it. You could define Schema to be a subclass of TypedDict (which doesn't really play nice with have a meta-class, but let's entertain it for a second), but then you'd need to define every attribute in the schema with List[]:

class Employee(Schema):
    id: List[int]
    name: List[str]

Which reads a bit funny to me.

nanne-aben avatar Mar 14 '23 14:03 nanne-aben