zenml icon indicating copy to clipboard operation
zenml copied to clipboard

Ignore columns for drift detection

Open SangamSwadiK opened this issue 2 years ago • 28 comments

Describe changes

Added ignored_columns to Evidently StandardStep to filter dataframe columns, to enable drift detection on selected columns.

  • This fixes the (Issue : #600 )

Pre-requisites

Please ensure you have done the following:

  • [x] I have read the CONTRIBUTING.md document.
  • [x] If my change requires a change to docs, I have updated the documentation accordingly.
  • [ ] If I have added an integration, I have updated the integrations table.
  • [x] I have added tests to cover my changes.

Types of changes

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] Other (add details above)

@strickvl Hi, Submitting another PR, on the same issue.

SangamSwadiK avatar Jun 14 '22 16:06 SangamSwadiK

Thank you for this. It would be useful to have some tests for this addition so that we know it's doing what it's supposed to do. Would you be able to add those?

strickvl avatar Jun 14 '22 19:06 strickvl

Thank you for this. It would be useful to have some tests for this addition so that we know it's doing what it's supposed to do. Would you be able to add those?

Hi, I would like to add some tests, but I'm not sure where to do it. I did go through the tests/integration/test_examples.py and tests/integration/example_validation.py and the below in tests/integration/test_examples.py

    ExampleIntegrationTestConfiguration(
        name="evidently_drift_detection",
        validation_function=drift_detection_example_validation,
    )

If you can please point me in the right direction, that would help me be able to add the tests. Thanks.

SangamSwadiK avatar Jun 15 '22 00:06 SangamSwadiK

I think at a minimum we'd want some unit tests to ensure that the ignored columns were really being ignored. I would maybe ignore the actual 'integration' tests (of the integration) since we are currently reworking them.

(Flagging @AlexejPenner so we remember to add testing for this as part of the new testing suite.)

strickvl avatar Jun 15 '22 08:06 strickvl

I think at a minimum we'd want some unit tests to ensure that the ignored columns were really being ignored. I would maybe ignore the actual 'integration' tests (of the integration) since we are currently reworking them.

(Flagging @AlexejPenner so we remember to add testing for this as part of the new testing suite.)

Got it. Thanks for clarification. I'll write the unit tests at zenml/tests/unit/integrations/evidently and resubmit again.

SangamSwadiK avatar Jun 15 '22 08:06 SangamSwadiK

@strickvl Hi, I have a noob question. I've written the tests, but I'm facing a type issue and getting a subclass error, its expecting the ignored_columns argument to be of type class, Optional[List[str]] gives a subclass error, however List[str] gets passed.

The below works

    def entrypoint(  # type: ignore[override]
        self,
        reference_dataset: pd.DataFrame,
        comparison_dataset: pd.DataFrame,
        config: EvidentlyProfileConfig,
        ignored_columns:List[str],

But changing List[str] to Optional[List[str]] (which is correct) results in typing.Union type which gives a subclass error : arg1 must be of type class. for the Basestep configuration check.

            if issubclass(arg_type, BaseStepConfig):
                # Raise an error if we already found a config in the signature
                if cls.CONFIG_CLASS is not None:
                    raise StepInterfaceError(
                        f"Found multiple configuration arguments "
                        f"('{cls.CONFIG_PARAMETER_NAME}' and '{arg}') when "
                        f"trying to create step '{name}'. Please make sure to "
                        f"only have one `BaseStepConfig` subclass as input "
                        f"argument for a step."
                    )

I tried using dataclasses, but that too gave a typing.Union. Any suggestions will be helpful.

Thanks.

SangamSwadiK avatar Jun 21 '22 06:06 SangamSwadiK

Hi, sorry for the delay in getting back to you on this. I think you could try changing ignored_columns:List[str] and making it ignored_columns: List[str] = None. Not sure if this is going to fix it, though...

strickvl avatar Jun 24 '22 14:06 strickvl

@strickvl Hi, I've updated the tests and modified the ProfileStep. Please let me know if I can improve or if any changes is needed.

SangamSwadiK avatar Jun 29 '22 16:06 SangamSwadiK

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
3 out of 4 committers have signed the CLA.

:white_check_mark: SangamSwadiK
:white_check_mark: strickvl
:white_check_mark: htahir1
:x: Sangam


Sangam seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Jul 03 '22 07:07 CLAassistant

I've fixed the darglint errors for raising value errors and removed unused imports.

SangamSwadiK avatar Jul 03 '22 08:07 SangamSwadiK

@SangamSwadiK seems like you now used an email that hasnt signed the CLA yet. Would you mind doing it?

htahir1 avatar Jul 04 '22 11:07 htahir1

@SangamSwadiK seems like you now used an email that hasnt signed the CLA yet. Would you mind doing it?

Hi, I mistakenly used a different username and a email id with a typo which doesnt exist and committed from that 😅. I forgot to check when committing, Can you please tell me if there is anyway to undo those changes ?. I haven't committed under any other email id.

SangamSwadiK avatar Jul 04 '22 12:07 SangamSwadiK

Hi, sorry for the delay in getting back to you on this. I think you could try changing ignored_columns:List[str] and making it ignored_columns: List[str] = None. Not sure if this is going to fix it, though...

Can't do None as default argument, because mypy is throwing error, I guess default can be just an empty list []. src/zenml/integrations/evidently/steps/evidently_profile.py:130: error: Incompatible default for argument "ignored_columns" (default has type "None", argument has type "List[str]

SangamSwadiK avatar Jul 04 '22 13:07 SangamSwadiK

Can't do None as default argument, because mypy is throwing error, I guess default can be just an empty list []. src/zenml/integrations/evidently/steps/evidently_profile.py:130: error: Incompatible default for argument "ignored_columns" (default has type "None", argument has type "List[str]

Yes I think an empty list as the default argument would make sense. Nothing would be dropped in that case.

strickvl avatar Jul 04 '22 14:07 strickvl

I haven't taken a look at the entire PR, but just quickly glanced over the comments and saw the discussion about an empty list as a default argument to a function. In general mutable default values (e.g. an emtpy list [] or an emtpy dict {}) as function arguments should be avoided in python, and I'm pretty sure mypy will also complain about that.

The reason for this is that it often leads to unwanted behaviour, like for example this:

def append(value, list_: List = []):
    list_.append(value)
    return list_
    
# the first time this works as expected, appending the value 3 to an empty list
print(append(3))  # [3]

# the second time, python uses the same instance of the list which already 
# contains elements from the previous function call
print(append(4))  # [3, 4]

I think there are two main ways to avoid this:

  • Use an emtpy tuple (immutable) as default value:
    def f(value: Tuple = ()):
        ...
    
  • Make the argument Optional and set the default value inside the function:
    def f(value: Optional[List] = None):
        value = value or []
        ...
    

schustmi avatar Jul 08 '22 11:07 schustmi

I haven't taken a look at the entire PR, but just quickly glanced over the comments and saw the discussion about an empty list as a default argument to a function. In general mutable default values (e.g. an emtpy list [] or an emtpy dict {}) as function arguments should be avoided in python, and I'm pretty sure mypy will also complain about that.

The reason for this is that it often leads to unwanted behaviour, like for example this:

def append(value, list_: List = []):
    list_.append(value)
    return list_
    
# the first time this works as expected, appending the value 3 to an empty list
print(append(3))  # [3]

# the second time, python uses the same instance of the list which already 
# contains elements from the previous function call
print(append(4))  # [3, 4]

I think there are two main ways to avoid this:

  • Use an emtpy tuple (immutable) as default value:
    def f(value: Tuple = ()):
        ...
    
  • Make the argument Optional and set the default value inside the function:
    def f(value: Optional[List] = None):
        value = value or []
        ...
    

@schustmi Thanks a lot for feedback and reminding ! I remember reading a warning for using mutable objects in function calls in python official docs, since it references the same object !. I guess we can go with an empty tuple as a default value since I can specifically mention tuple elements to be of string type. What do you suggest ?

SangamSwadiK avatar Jul 08 '22 11:07 SangamSwadiK

@SangamSwadiK As far as I can understand from code, we'll want an if condition that checks if the ignore_columns are emtpy/None anyway right? In that case I'd personally go with the Optional[List[str]], but both should be fine.

def entrypoint(..., ignored_columns: Optional[List[str]] = None):
    if ignored_columns:
        # In here you can treat it as a List[str] with elements in it, and mypy should also
        # be aware of that automatically due to the if-condition above
        ...

schustmi avatar Jul 08 '22 11:07 schustmi

@SangamSwadiK As far as I can understand from code, we'll want an if condition that checks if the ignore_columns are emtpy/None anyway right? In that case I'd personally go with the Optional[List[str]], but both should be fine.

def entrypoint(..., ignored_columns: Optional[List[str]] = None):
    if ignored_columns:
        # In here you can treat it as a List[str] with elements in it, and mypy should also
        # be aware of that automatically due to the if-condition above
        ...

I've tried Optional[List[str]] but it gives a subclass error because it results in typing.Union. The error comes from Basestep configuration check.

            if issubclass(arg_type, BaseStepConfig):
                # Raise an error if we already found a config in the signature
                if cls.CONFIG_CLASS is not None:
                    raise StepInterfaceError(
                        f"Found multiple configuration arguments "
                        f"('{cls.CONFIG_PARAMETER_NAME}' and '{arg}') when "
                        f"trying to create step '{name}'. Please make sure to "
                        f"only have one `BaseStepConfig` subclass as input "
                        f"argument for a step."
                    )

Error message: subclass error : arg1 must be of type class,

SangamSwadiK avatar Jul 08 '22 12:07 SangamSwadiK

@strickvl Hi, I have a noob question. I've written the tests, but I'm facing a type issue and getting a subclass error, its expecting the ignored_columns argument to be of type class, Optional[List[str]] gives a subclass error, however List[str] gets passed.

The below works

    def entrypoint(  # type: ignore[override]
        self,
        reference_dataset: pd.DataFrame,
        comparison_dataset: pd.DataFrame,
        config: EvidentlyProfileConfig,
        ignored_columns:List[str],

But changing List[str] to Optional[List[str]] (which is correct) results in typing.Union type which gives a subclass error : arg1 must be of type class. for the Basestep configuration check.

            if issubclass(arg_type, BaseStepConfig):
                # Raise an error if we already found a config in the signature
                if cls.CONFIG_CLASS is not None:
                    raise StepInterfaceError(
                        f"Found multiple configuration arguments "
                        f"('{cls.CONFIG_PARAMETER_NAME}' and '{arg}') when "
                        f"trying to create step '{name}'. Please make sure to "
                        f"only have one `BaseStepConfig` subclass as input "
                        f"argument for a step."
                    )

I tried using dataclasses, but that too gave a typing.Union. Any suggestions will be helpful.

Thanks.

@schustmi Because of this, Hence decided to move ahead with List[str] previously.

SangamSwadiK avatar Jul 08 '22 12:07 SangamSwadiK

Ah sorry my bad, I didn't read the previous comments. I now see that this is part of a ZenML step, which unfortunately doesn't allow Optional inputs which explains the error. Additionally, it right now doesn't work with default values (definitely something we want to tackle in the future to make these steps more flexible).

We could put the ignore_columns in the EvidentlyProfileConfig for that step instead, which would allow both Optional and default values. However I'm not familiar enough with this integration to know if this is something that makes sense or is required on the step, maybe @strickvl has an opinion on that.

schustmi avatar Jul 08 '22 14:07 schustmi

@schustmi @SangamSwadiK I think it's reasonable to add the columns to the EvidentlyProfileConfig object. I don't see any reasonable why that wouldn't be work. But it will change a few things to the current implementation. (And the tests would need to change as well). But then once implemented that way you can get the values by calling config.ignored_columns or something to that equivalent.

strickvl avatar Jul 11 '22 19:07 strickvl

@schustmi @SangamSwadiK I think it's reasonable to add the columns to the EvidentlyProfileConfig object. I don't see any reasonable why that wouldn't be work. But it will change a few things to the current implementation. (And the tests would need to change as well). But then once implemented that way you can get the values by calling config.ignored_columns or something to that equivalent.

I completely support this idea. The columns are definitely part of the step configuration instead of being step artifacts. The difference is this:

  • use a step artifact as input for your step if you need information to be dynamically generated from another step in a pipeline. In other words, you would need another step in your pipeline to dynamically generate the list of columns to be used as input by the Evidently profiler step.
  • use a step configuration as input for your step if you know what the configuration is before you run the pipeline.

This definitely feels like a configuration attribute, not like a pipeline artifact.

stefannica avatar Jul 12 '22 16:07 stefannica

Thanks for the clear explanation. I'll try with EvidentlyProfileConfig and update here.

SangamSwadiK avatar Jul 15 '22 01:07 SangamSwadiK

As a heads-up, we've updated Evidently to the last version in the last release, so you'll have to rebase your PR and test with the latest changes again.

stefannica avatar Jul 27 '22 13:07 stefannica

As a heads-up, we've updated Evidently to the last version in the last release, so you'll have to rebase your PR and test with the latest changes again.

Thanks for the heads-up, I encountered a bug in older version of evidently used in ZenML previously for handling dtype object in drift. Need to check with the updated one too.

SangamSwadiK avatar Jul 27 '22 16:07 SangamSwadiK

Updated and made column mapping as part of evidently config also updated the tests.

SangamSwadiK avatar Aug 09 '22 06:08 SangamSwadiK

@strickvl Hi, I've Incorporated all the changes and modified tests.

SangamSwadiK avatar Aug 10 '22 06:08 SangamSwadiK

One other point... in order for the tests to run on our Github CI system, you'll need to add the Evidently integration to be installed for unit-tests.

You can do this in .github/workflows/unit-test.yml. You'll find a line that reads:

zenml integration install kubeflow s3 gcp azure gcp_secrets_manager vertex vault pillow -f

You just have to add evidently to that list so it becomes:

zenml integration install kubeflow s3 gcp azure gcp_secrets_manager vertex vault pillow evidently -f

And then (after you commit and push it) the tests will be able to run!

Thanks!

strickvl avatar Aug 11 '22 08:08 strickvl

@strickvl Hi, I've incorporated the changes, also modified the tests a bit.

SangamSwadiK avatar Aug 11 '22 14:08 SangamSwadiK

@SangamSwadiK Seems like all the tests are failing currently.

strickvl avatar Aug 12 '22 09:08 strickvl

@SangamSwadiK Seems like all the tests are failing currently.

Hi, the errors are related to the data validators, It asks to install a data validator component, can you please suggest me how can I go about using it ?

Thanks.

SangamSwadiK avatar Aug 12 '22 09:08 SangamSwadiK