pandera
pandera copied to clipboard
Less strict numerical type
Is there any type that represents a numerical column (includes int, float etc.)?
currently there's no way of specifying a "number" column since right now pandera adheres to pandas data types (and also in general python doesn't have a generic number type), although with @jeffzi's work on #369 you could make custom datatypes like this.
for now I'd recommend specifying a float since floats are a superset of integers.
oh, I guess another way of doing this would be to specify pandas_dtype = None (the default) and then use a Check to validate a number type:
import pandera as pa
from pandas.api.types import is_number
is_number = pa.Check(lambda s: s.map(is_number), name="is_number")
schema = pa.DataFrameSchema({
"column": pa.Column(checks=is_number)
})
schema(pd.DataFrame({"column": [1,2,"a"]}))
# Output
SchemaError: <Schema Column(name=column, type=None)> failed element-wise validator 0:
<Check is_number>
failure cases:
index failure_case
0 2 a
although with @jeffzi's work on #369 you could make custom datatypes like this.
We could even have a built-in Number dype. Coercion would output floats or ints depending on the actual values (same as pandas.to_numeric)
although with @jeffzi's work on #369 you could make custom datatypes like this.
We could even have a built-in
Numberdype. Coercion would output floats or ints depending on the actual values (same as pandas.to_numeric)
I think we should add a built-in Number type that includes all kinds of integers and floats because we have huge datasets and checks with mapping would not be the best performant case. @cosmicBboy
I think we should add a built-in Number type that includes all kinds of integers and floats
The higher-level data types are still TBD, but Number will most likely be one of them
In the mean time, the more performant thing to do would be
from pandas.api.types import is_numeric_dtype
is_number = pa.Check(is_numeric_dtype, name="is_number")
schema = pa.DataFrameSchema({"column": pa.Column(checks=is_number)})
schema(pd.DataFrame({"column": [1,2,"a"]}))
# Output
SchemaError: <Schema Column(name=column, type=None)> failed series validator 0:
<Check is_number>
Not that it won't be as informative an error message (no indication of which element caused the check to fail).
@cosmicBboy I propose to add enhancement tag to this issue.
adjusted the tags, PR is welcome after the fix for #369 is done
@cosmicBboy If adding Number type will take time, could you add a build-in check that can be serializable and suitable for data synthesis?
hey @quancore you can register checks into the pa.Check namespace with the extensions API. I'd recommend doing that, as I don't think it makes sense to temporarily add a built-in check for this type if there will be a first-class representation of it in the new type system.
Let me know if you need any help with the strategy implementation!
After #369 and #559 what is the preferred solution here? Still https://github.com/pandera-dev/pandera/issues/466#issuecomment-824062087 or https://github.com/pandera-dev/pandera/issues/466#issuecomment-824834333?
The second (https://github.com/unionai-oss/pandera/issues/466#issuecomment-824834333) seems most efficient as it uses is_numeric_dtype (no element-wise check)