pandera
pandera copied to clipboard
unable to use error="Custom Error" in in_range function
unable to use error="Custom Error" in in_range function
@cosmicBboy Can you please help me resolve this?
This behavior is intentional, as the built-in checks already define this argument: https://github.com/pandera-dev/pandera/blob/master/pandera/checks.py#L747
Can you please explain your use case?
Hi @cosmicBboy,
We have a feature which allows our final user to upload a (big) CSV file so that we analyze its data and return some metrics to him. However, we need to validate fit the CSV file and data are valid. We use Pandera for that.
The happy path is that the user will upload a perfect file and we will be able to compute all metrics from that.
The not so happy path, the user will upload a file with some items with invalid/unexpected values and we want to tell the final user exactly what was wrong with his input file in the most user-friendly way possible. In order to achieve this, we manipulate the pa.errors.SchemaErrors
exceptions and being able to set a custom error message would make our lives a lot easier!
LMK if you want a more detailed explanation of our use-case or need help with this change =D
Hi @lcbm, are you using custom checks or built-in checks?
For custom checks you can pass in a string to the error
argument. For built-in checks this is currently not possible, but it would be a relatively easy fix to support an override for all built-in checks:
https://github.com/pandera-dev/pandera/blob/master/pandera/checks.py#L524-L529
return cls(
_equal,
name=cls.equal_to.__name__,
error=f"equal_to({value})",
**kwargs,
)
can be refactored to
if "error" not in kwargs:
kwargs["error"] = f"equal_to({value})"
return cls(
_equal,
name=cls.equal_to.__name__,
**kwargs,
)
A PR for this solution would be very much appreciated! @lcbm let me know if you want to give it a shot and I can help you through the contribution process.
Hi @cosmicBboy, we are using custom checks because of this limitation but we would prefer using built-ins (for example, in the pandera.Check.str_matches
case) :smile:
Indeed it seems like an easy fix, I will definitely take a look into it either today when I'm done with my working hours or tomorrow first thing in the morning. In either one of these times, I may open a draft MR and can @ you then or LMK if you rather to proceed some other way :grimacing:
Thanks for the quick response and giving me the opportunity to contribute to the project :rocket:
Hi @lcbm and @cosmicBboy , I have a similar use case - is this change in progress atm?
Hi @telferm57, I actually didnt have the time to take this on... at the time I was hoping to use some of my working hours to contribute but we had so many high priority stories to work on that we chose to use custom checks instead :disappointed:
However, I did take some time to look at the code, with @cosmicBboy's suggestion in mind, and indeed its a relatively simple change. If you decide to take this on, I could help with code review and discussions (I have more free time now, I was doing my bachelor's degree and working full-time job back then) :smile:
Otherwise, I'll leave an example of what we decided to do instead:
schema = pa.DataFrameSchema({
# ...
consts.DATE: pa.Column(
dtype="string",
required=True,
nullable=False,
checks=[
pa.Check(
check_fn=lambda s: s.str.match(consts.DATE_REGEX),
error=f"hour must match {consts.DATE_EXPECTED_FORMAT} format",
)
],
),
})
Just a FYI, we faced an issue where the check failed to apply because our check function expected a pandas.Series (hence s.str.match
) but it seems that, at times, it received the value instead (@cosmicBboy LMK if I should file a bug, I can try to reproduce it again). We did not debug as much as we liked to, due to other priorities (:sweat_smile:) but this was our temporary fix:
def _starts_with(value, prefix):
"""Checks if ``value`` starts with ``prefix``. This function is necessary because
Pandera will not always evaluate the value applied to each check as a ``pandas.Series``.
"""
if isinstance(value, pd.Series):
return value.str.startswith(prefix)
if isinstance(value, str):
return value.startswith(prefix)
raise TypeError("'value' must be of either type 'pandas.Series' or 'str'")
# ...
schema = pa.DataFrameSchema({
consts.MACHINE: pa.Column(
dtype="string",
required=True,
nullable=False,
checks=[
pa.Check(
check_fn=lambda s: _starts_with(s, tuple(consts.VALID_MACHINE_PREFIXES)),
error=f"must start with one of {consts.VALID_MACHINE_PREFIXES}",
),
],
),
})
Hi, thanks for the helpful post, ...I will try to take this change on over the next couple of weeks
@cosmicBboy , I've forked repo, having issues with running pyspark test under windows - do do this change , is it only the core test suite I need to worry about ?
hi @telferm57 yes I think for this issue the core test suite would be the main thing to run
so you can just do pytest tests/core
for now (for safety you can also try running it for the other extras) but once you make a PR CI/CD should be able to catch any unforeseen issues with the extras.
having issues with running pyspark test under windows
yeah, so pyspark for windows is currently untested: https://github.com/pandera-dev/pandera/blob/master/.github/workflows/ci-tests.yml#L174-L176
If it's not too much trouble, would you mind opening up a bug report with what you're seeing? that way a windows user might be able to fix the issue.