pandera icon indicating copy to clipboard operation
pandera copied to clipboard

unable to use error="Custom Error" in in_range function

Open vivek89007 opened this issue 3 years ago • 10 comments

unable to use error="Custom Error" in in_range function

vivek89007 avatar Jul 28 '21 12:07 vivek89007

@cosmicBboy Can you please help me resolve this?

vivek89007 avatar Jul 28 '21 12:07 vivek89007

This behavior is intentional, as the built-in checks already define this argument: https://github.com/pandera-dev/pandera/blob/master/pandera/checks.py#L747

Can you please explain your use case?

cosmicBboy avatar Jul 29 '21 02:07 cosmicBboy

Hi @cosmicBboy,

We have a feature which allows our final user to upload a (big) CSV file so that we analyze its data and return some metrics to him. However, we need to validate fit the CSV file and data are valid. We use Pandera for that.

The happy path is that the user will upload a perfect file and we will be able to compute all metrics from that.

The not so happy path, the user will upload a file with some items with invalid/unexpected values and we want to tell the final user exactly what was wrong with his input file in the most user-friendly way possible. In order to achieve this, we manipulate the pa.errors.SchemaErrors exceptions and being able to set a custom error message would make our lives a lot easier!

LMK if you want a more detailed explanation of our use-case or need help with this change =D

lcbm avatar Aug 23 '21 18:08 lcbm

Hi @lcbm, are you using custom checks or built-in checks?

For custom checks you can pass in a string to the error argument. For built-in checks this is currently not possible, but it would be a relatively easy fix to support an override for all built-in checks:

https://github.com/pandera-dev/pandera/blob/master/pandera/checks.py#L524-L529

        return cls(
            _equal,
            name=cls.equal_to.__name__,
            error=f"equal_to({value})",
            **kwargs,
        )

can be refactored to

       if "error" not in kwargs:
           kwargs["error"] = f"equal_to({value})"
        return cls(
            _equal,
            name=cls.equal_to.__name__,
            **kwargs,
        )

A PR for this solution would be very much appreciated! @lcbm let me know if you want to give it a shot and I can help you through the contribution process.

cosmicBboy avatar Aug 24 '21 01:08 cosmicBboy

Hi @cosmicBboy, we are using custom checks because of this limitation but we would prefer using built-ins (for example, in the pandera.Check.str_matches case) :smile:

Indeed it seems like an easy fix, I will definitely take a look into it either today when I'm done with my working hours or tomorrow first thing in the morning. In either one of these times, I may open a draft MR and can @ you then or LMK if you rather to proceed some other way :grimacing:

Thanks for the quick response and giving me the opportunity to contribute to the project :rocket:

lcbm avatar Aug 24 '21 11:08 lcbm

Hi @lcbm and @cosmicBboy , I have a similar use case - is this change in progress atm?

telferm57 avatar Mar 23 '22 21:03 telferm57

Hi @telferm57, I actually didnt have the time to take this on... at the time I was hoping to use some of my working hours to contribute but we had so many high priority stories to work on that we chose to use custom checks instead :disappointed:

However, I did take some time to look at the code, with @cosmicBboy's suggestion in mind, and indeed its a relatively simple change. If you decide to take this on, I could help with code review and discussions (I have more free time now, I was doing my bachelor's degree and working full-time job back then) :smile:

Otherwise, I'll leave an example of what we decided to do instead:

schema = pa.DataFrameSchema({ 
    # ...
    consts.DATE: pa.Column(
        dtype="string",
        required=True,
        nullable=False,
        checks=[
            pa.Check(
                check_fn=lambda s: s.str.match(consts.DATE_REGEX),
                error=f"hour must match {consts.DATE_EXPECTED_FORMAT} format",
            )
        ],
    ),
})

Just a FYI, we faced an issue where the check failed to apply because our check function expected a pandas.Series (hence s.str.match) but it seems that, at times, it received the value instead (@cosmicBboy LMK if I should file a bug, I can try to reproduce it again). We did not debug as much as we liked to, due to other priorities (:sweat_smile:) but this was our temporary fix:

def _starts_with(value, prefix):
    """Checks if ``value`` starts with ``prefix``. This function is necessary because 
    Pandera will not always evaluate the value applied to each check as a ``pandas.Series``.
    """
    if isinstance(value, pd.Series):
        return value.str.startswith(prefix)

    if isinstance(value, str):
        return value.startswith(prefix)

    raise TypeError("'value' must be of either type 'pandas.Series' or 'str'")


# ...
schema = pa.DataFrameSchema({ 
    consts.MACHINE: pa.Column(
        dtype="string",
        required=True,
        nullable=False,
        checks=[
            pa.Check(
                check_fn=lambda s: _starts_with(s, tuple(consts.VALID_MACHINE_PREFIXES)),
                error=f"must start with one of {consts.VALID_MACHINE_PREFIXES}",
            ),
        ],
    ),
})

lcbm avatar Mar 24 '22 14:03 lcbm

Hi, thanks for the helpful post, ...I will try to take this change on over the next couple of weeks

telferm57 avatar Mar 26 '22 14:03 telferm57

@cosmicBboy , I've forked repo, having issues with running pyspark test under windows - do do this change , is it only the core test suite I need to worry about ?

telferm57 avatar Apr 06 '22 20:04 telferm57

hi @telferm57 yes I think for this issue the core test suite would be the main thing to run

so you can just do pytest tests/core for now (for safety you can also try running it for the other extras) but once you make a PR CI/CD should be able to catch any unforeseen issues with the extras.

having issues with running pyspark test under windows

yeah, so pyspark for windows is currently untested: https://github.com/pandera-dev/pandera/blob/master/.github/workflows/ci-tests.yml#L174-L176

If it's not too much trouble, would you mind opening up a bug report with what you're seeing? that way a windows user might be able to fix the issue.

cosmicBboy avatar Apr 07 '22 03:04 cosmicBboy