pandera
pandera copied to clipboard
Name of single index is None in SchemaModel
Love the work this library is enabling!
Describe the bug A clear and concise description of what the bug is.
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of pandera.
- [ ] (optional) I have confirmed this bug exists on the master branch of pandera.
Code Sample, a copy-pastable example
from pandera import SchemaModel
from pandera.typing import Index, Series, String
# Contrived schema with single index, 1 column
class MySchema(SchemaModel):
# Index
foo: Index[String]
# Column(s)
bar: Series[String]
# Attempt to see name of index
MySchema.to_schema().index
<Schema Index(name=None, type=DataType(str))>
The names are kept when using a multi-index, but not when a single index is specified, as above.
Expected behavior
The name attribute of the index should be foo
in the above example.
Desktop (please complete the following information):
- OS: Debian GNU/Linux 11 (bullseye)
- Browser: chrome
Workaround
There's gotta be a better way to do this, but here's my hacky way to get this to work for now:
from pandera import SchemaModel as PanderaSchemaModel
from pandera.typing import String, Index, Series
class SchemaModel(PanderaSchemaModel):
def __init_subclass__(cls, **kwargs):
super().__init_subclass__(**kwargs)
# Populate cls.__schema__
cls.to_schema()
if (
(index := getattr(cls.__schema__, "index", None)) and
(index.name is None)
):
# Find name of index, assuming it is the only name from the list
# of fields that is not present in columns
for field in cls.__fields__:
if field not in cls.__schema__.columns:
cls.__schema__.index._name = field
break
# Contrived schema
class MySchema(SchemaModel):
# Index
foo: Index[String]
# Column(s)
bar: Series[String]
# Attempt to see name of index
MySchema.to_schema().index
<Schema Index(name=foo, type=DataType(str))>
I can verify this issue with pandera 0.11.0
. Pretty annoying. Besides that: awesome package!
This need to be documented better, but you need to supply the check_name=True
argument to pa.Field
in order to preserve single-index schema metadata when converting to_schema
.
See example here
The API reference has a more complete description: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.model_components.Field.html#pandera.model_components.Field
check_name (Optional[bool]) – Whether to check the name of the column/index during validation. None is the default behavior, which translates to True for columns and multi-index, and to False for a single index.
This is the default behavior because, in many cases, single-index dataframes are often not named, and there's no way to have an un-named index in SchemaModel
s. This caused an issue where validation would fail since the SchemaModel
s with indexes would try to validate some index name (e.g. foo
in the issue description), see #326.
Hence the check_name=None
arg has different behavior depending on single or multi-array indexes.
I can verify this issue with pandera 0.11.0. Pretty annoying.
Any chance you want to channel that energy to a PR with an example in the docs somewhere on this page @hoffch ?? 😀
@cosmicBboy , thank you for the detailed explanation on this. I won't be able to get to it right away, but I can submit a PR with example in the docs.
@cosmicBboy Thanks for the clarification, the rationale is pretty convincing. Unfortunately, I can't contribute a PR in the forseeable future. Double thanks to @the-matt-morris for doing so instead of me!