pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Less strict numerical type

Open quancore opened this issue 4 years ago • 11 comments

Is there any type that represents a numerical column (includes int, float etc.)?

quancore avatar Apr 20 '21 15:04 quancore

currently there's no way of specifying a "number" column since right now pandera adheres to pandas data types (and also in general python doesn't have a generic number type), although with @jeffzi's work on #369 you could make custom datatypes like this.

for now I'd recommend specifying a float since floats are a superset of integers.

cosmicBboy avatar Apr 21 '21 13:04 cosmicBboy

oh, I guess another way of doing this would be to specify pandas_dtype = None (the default) and then use a Check to validate a number type:

import pandera as pa
from pandas.api.types import is_number

is_number = pa.Check(lambda s: s.map(is_number), name="is_number")

schema = pa.DataFrameSchema({
    "column": pa.Column(checks=is_number)
})

schema(pd.DataFrame({"column": [1,2,"a"]}))

# Output
SchemaError: <Schema Column(name=column, type=None)> failed element-wise validator 0:
<Check is_number>
failure cases:
   index failure_case
0      2            a

cosmicBboy avatar Apr 21 '21 13:04 cosmicBboy

although with @jeffzi's work on #369 you could make custom datatypes like this.

We could even have a built-in Number dype. Coercion would output floats or ints depending on the actual values (same as pandas.to_numeric)

jeffzi avatar Apr 21 '21 22:04 jeffzi

although with @jeffzi's work on #369 you could make custom datatypes like this.

We could even have a built-in Number dype. Coercion would output floats or ints depending on the actual values (same as pandas.to_numeric)

I think we should add a built-in Number type that includes all kinds of integers and floats because we have huge datasets and checks with mapping would not be the best performant case. @cosmicBboy

quancore avatar Apr 22 '21 07:04 quancore

I think we should add a built-in Number type that includes all kinds of integers and floats

The higher-level data types are still TBD, but Number will most likely be one of them

In the mean time, the more performant thing to do would be

from pandas.api.types import is_numeric_dtype

is_number = pa.Check(is_numeric_dtype, name="is_number")
schema = pa.DataFrameSchema({"column": pa.Column(checks=is_number)})
schema(pd.DataFrame({"column": [1,2,"a"]}))

# Output
SchemaError: <Schema Column(name=column, type=None)> failed series validator 0:
<Check is_number>

Not that it won't be as informative an error message (no indication of which element caused the check to fail).

cosmicBboy avatar Apr 22 '21 13:04 cosmicBboy

@cosmicBboy I propose to add enhancement tag to this issue.

quancore avatar Apr 27 '21 08:04 quancore

adjusted the tags, PR is welcome after the fix for #369 is done

cosmicBboy avatar May 02 '21 20:05 cosmicBboy

@cosmicBboy If adding Number type will take time, could you add a build-in check that can be serializable and suitable for data synthesis?

quancore avatar May 03 '21 21:05 quancore

hey @quancore you can register checks into the pa.Check namespace with the extensions API. I'd recommend doing that, as I don't think it makes sense to temporarily add a built-in check for this type if there will be a first-class representation of it in the new type system.

Let me know if you need any help with the strategy implementation!

cosmicBboy avatar May 04 '21 12:05 cosmicBboy

After #369 and #559 what is the preferred solution here? Still https://github.com/pandera-dev/pandera/issues/466#issuecomment-824062087 or https://github.com/pandera-dev/pandera/issues/466#issuecomment-824834333?

fleimgruber avatar Mar 17 '22 13:03 fleimgruber

The second (https://github.com/unionai-oss/pandera/issues/466#issuecomment-824834333) seems most efficient as it uses is_numeric_dtype (no element-wise check)

smarie avatar Mar 19 '24 16:03 smarie