strictly_typed_pandas
strictly_typed_pandas copied to clipboard
Schema and known column names
We have the convention to create a class that holds the names of the columns of a DataFrame, e.g.
class ColumnNames:
height = "height"
weight = "weight"
CN = ColumnNames
df = pd.DataFrame({CN.height: [1.9, 1.7, 2.1], CN.weight: [33, 40, 41]})
If now everyone uses the column names from the class rather than raw strings, we a) get a proposal from the IDE what columns to expect (by simply typing CN.<TAB> etc if you use df[CN.weight] rather than df.weight) and b) can re-name columns.
I am wondering, whether we can either use a schema to get the column names or define the schema using column names.
If we look at the example of getting started you see that "id" appears twice: Once in the definition of the Schema and once as string in the DataFrame to create.
from strictly_typed_pandas import DataSet
class Schema:
id: int
name: str
def foo(df: DataSet[Schema]) -> DataSet[Schema]:
# do stuff
return df
df = DataSet[Schema]({"id": [1, 2, 3], "name": ["John", "Jane", "Jack"]})
I guess I would like to be able to change the last line to something like
CN = Schema.names
df = DataSet[Schema]({CN.id: [1, 2, 3], CN.name: ["John", "Jane", "Jack"]})
or even directly use the names
df = DataSet[Schema]({Schema.id: [1, 2, 3], Schema.name: ["John", "Jane", "Jack"]})
Alternatively, it would be interesting to be able to generate a Schema but in a "functional style" (like you can do with an enum):
Schema = DataSchema("Schema", {"id": int, "name": str})
# or in my case
Schema = DataSchema("Schema", {CN.id: int, CN.name: str})
Is such a feature available and I just overlooked it? If not, do you think it is a reasonable thing to provide?
Nice! I've actually been thinking along the same lines.
In my current job, I'm working on a similar package right now for adding schemas to pyspark. One of the things it supports is exactly what you describe: using the column names from the schemas, to support auto-complete and easy refactoring.
It's not open source yet, but once it is, we should look into transferring certain functionality to strictly_typed_pandas as well.
Until then, the way I approached it (roughly):
- Make all schemas import a
Schemaclass.- Every schema someone defines needs to subclass this
Schema - Backward compatibility: if people don't subclass their schemas from
Schema, that's fine, but they'll miss out on all the nice features we're gonna add here.
- Every schema someone defines needs to subclass this
- Make a meta-class
MetaSchema, add this toSchema- Schema will be the public interface, because subclassing is just simpler than meta-classing
- In
MetaSchema, add a definition to__getattribute__(), in which we return a string every time on of its attributes is called.
My time right now is a bit limited, I'd like to focus on open-sourcing the other package first. If you're willing to make a contribution, I can share a bit more of the code that we've used to do the above? Otherwise, I'll get back to it at a later point.
Hi, the following code allows for that. Do we think that this is "it" already?
import pandas as pd
from typing import List, Type
class _SchemaMeta(type):
"""
Metaclass for Schemas.
"""
def __new__(metacls, cls, bases, classdict):
# create the actual class
schema_class = super().__new__(metacls, cls, bases, classdict)
schema_class._member_names_ = list(classdict.get("__annotations__", {}).keys())
return schema_class
def __getattr__(cls, name):
"""
Return the name of the member matching `name` if it had some annotation (i.e. is usable as schema) on it.
"""
if name in cls._member_names_:
return name
else:
raise AttributeError(
f"Unknown attribute '{name}', please use one of {cls._member_names_}."
)
class Schema(metaclass=_SchemaMeta):
"""
Generic schema.
Derive from this class to define new schemas.
"""
pass
def get_columns(schema: Type[Schema]) -> List[str]:
return schema._member_names_
class Employee(Schema):
id: int
name: str
df = pd.DataFrame({Employee.id: [1, 2, 3], Employee.name: ["John", "Jack", "Alfred"]})
assert list(df.columns) == get_columns(Employee)
I am wondering whether we can go one step further: If we create a
DataSet[Employee]({Employee.id: [1, 2, 3], Employee.name: ["John", "Jack", "Alfred"]})
it would just be awesome, if we could check, that this dict is correctly typed. What I mean by that is: If the __init__ method of DataSet[Employee] could be typed to use a TypedDict, then mypy would warn us, if we did something like
DataSet[Employee]({Employee.id: [1, 2, 3], Employee.name: [1, 2, 3]})
where we right now only get a runtime error.
I tried to do that via the following
from __future__ import annotations
import pandas as pd
from typing import List, Type, TypeVar, TypedDict, Iterable, Dict, Generic
class _SchemaMeta(type):
"""
Metaclass for Schemas.
"""
def __new__(metacls, cls, bases, classdict):
# create the actual class
schema_class = super().__new__(metacls, cls, bases, classdict)
schema_class._member_names_ = list(classdict.get("__annotations__", {}).keys())
types: Dict[str, Type] = {
name: Iterable[type_]
for name, type_ in classdict.get("__annotations__", {}).items()
}
schema_class._expected_input_ = TypedDict("ExpectedInput", types)
return schema_class
def __getattr__(cls, name):
"""
Return the name of the member matching `name` if it had some annotation (i.e. is usable as schema) on it.
"""
if name in cls._member_names_:
return name
else:
raise AttributeError(
f"Unknown attribute '{name}', please use one of {cls._member_names_}."
)
class Schema(metaclass=_SchemaMeta):
"""
Generic schema.
Derive from this class to define new schemas.
"""
pass
def get_columns(schema: Type[Schema]) -> List[str]:
return schema._member_names_
S = TypeVar("S", bound=Schema)
class DataSet(Generic[S]):
def __init__(self, dataframe: pd.DataFrame):
self.dataframe = dataframe
@staticmethod
def from_dict(dict_data: S._expected_input_) -> DataSet[S]:
return DataSet[S](pd.DataFrame(dict_data))
class Employee(Schema):
id: int
name: str
ds = DataSet[Employee].from_dict(
{Employee.id: [1, 2, 3], Employee.name: ["John", "Jack", "Alfred"]}
)
but unfortunately mypy complains with Name "S._expected_input_" is not defined.
What do you think about this idea and do you have a pointer for me to get this fixed?
W.r.t. your first post: yes, that's it! If you make a PR we can add it in.
W.r.t. the second post: I like the idea! But yeah, I've thought about it too, never quite figured out how to do it. S._expected_input_ is only defined during run-time, so mypy will never know it. You could define Schema to be a subclass of TypedDict (which doesn't really play nice with have a meta-class, but let's entertain it for a second), but then you'd need to define every attribute in the schema with List[]:
class Employee(Schema):
id: List[int]
name: List[str]
Which reads a bit funny to me.