evidently icon indicating copy to clipboard operation
evidently copied to clipboard

The default behavior using a reference in TestShareOfOutRangeValues

Open samuelamico opened this issue 2 years ago • 7 comments

Hello, the issue is: In the documentation (All Tests), the following test: TestShareOfOutRangeValues have as the default value this configuration -

With reference: the test fails if over 10% of values are out of range.

However when we perform the Test:

data_quality = TestSuite(tests=[
    TestShareOfOutRangeValues(column_name='HouseAge')
])

data_quality.run(reference_data=ref,current_data=curr,column_mapping=schema)
data_quality.as_dict()

The result is:

'version': '0.1.58.dev0',
 'datetime': '2022-10-27T09:57:01.408264',
 'tests': [{'name': 'Share of Out-of-Range Values',
   'description': 'The share of values out of range in the column **HouseAge** is 0.0002 (1 out of 5000).  The test threshold is eq=0 ± 1e-12.',
   'status': 'FAIL',

The test is not using the Reference as a value, indeed the Condition is eq=0, as showed by in the source code: TestShare Source Code

class TestShareOfOutRangeValues(BaseDataQualityValueRangeMetricsTest):
    name = "Share of Out-of-Range Values"

    def get_condition(self) -> TestValueCondition:
        if self.condition.has_condition():
            return self.condition
        return TestValueCondition(eq=approx(0))

samuelamico avatar Oct 27 '22 13:10 samuelamico

Thanks a lot for raising the issue @samuelamico! It is a mistake in the documentation.

By default, the test does use the reference (to learn the reference value ranges) but expects all values in the current data to stay in this range. I added a PR to update the docs to match the current implementation: https://github.com/evidentlyai/evidently/pull/425

elenasamuylova avatar Oct 27 '22 18:10 elenasamuylova

Hi, do you plan to make the margin configurable? I.e. the test fails if over XX% of values are out of range ?

Thanks

anh-le-profinit avatar Jul 26 '23 13:07 anh-le-profinit

Hi @anh-le-profinit,

It is possible to configure custom conditions for all tests. Here is the documentation: https://docs.evidentlyai.com/user-guide/tests-and-reports/custom-test-suite#3.-set-test-conditions

For example, if you want the test to fail if more than 10% of values in the column "age" are out of range (with the range derived automatically from the reference dataset):

my_tests = TestSuite(tests=[
TestShareOfOutRangeValues(column_name='age', lt=0.1),
])

If you also want to set a manual range of the feature value (for example, from 10 to 80):

my_tests = TestSuite(tests=[
TestShareOfOutRangeValues(column_name='age', left=10, right=80, lt=0.1),
])

elenasamuylova avatar Jul 26 '23 13:07 elenasamuylova

Thanks Elena,

my mistake, my question was regarding similar, but slightly different tests - TestColumnShareOfMissingValues and TestMostCommonValueShare.

There, a reference dataset is used to set a reference metric and the tests check whether the current metric is within a certain range. The range is now fixed at 10% around the reference, which for low reference values can be very strict. Is there a way to relax this constraint (or do you plan to introduce it in the future?)

anh-le-profinit avatar Jul 26 '23 13:07 anh-le-profinit

Hi @anh-le-profinit,

It works exactly the same - you can pass custom conditions to any Evidently Test.

For example, if you want the test to fail if share of missing values is >= 20%, here is how you do that.

my_tests = TestSuite(tests=[
TestColumnShareOfMissingValues(column_name='age', lt=0.2),
])

Here are the docs on standard parameters you can use to set test conditons (lt, gt, eq, etc.): https://docs.evidentlyai.com/user-guide/tests-and-reports/custom-test-suite#3.-set-test-conditions

elenasamuylova avatar Jul 26 '23 13:07 elenasamuylova

In this case you will set the Test condition without comparing it to the reference - the Test will simply check if the total share of missing values in the current dataset is over 20%.

It is not currently possible to set a different condition relative to the reference automatically. If you want to set a condition as +/-20% from reference, you need to first derive the share of missing values in your reference dataset, and then use approx (explained here: https://docs.evidentlyai.com/user-guide/tests-and-reports/custom-test-suite#custom-conditions-with-approx). Here is how you set the boundary as 5 +/-20%: lt=approx(5, relative=0.2)

We plan to add the ability to set the condition relative to the reference in the future.

elenasamuylova avatar Jul 26 '23 13:07 elenasamuylova

Great, this answers my question :)

We plan to add the ability to set the condition relative to the reference in the future.>

Looking forward to that moment. Thanks for all the clarification

anh-le-profinit avatar Jul 26 '23 13:07 anh-le-profinit