data-validation icon indicating copy to clipboard operation
data-validation copied to clipboard

TFDV does not catch out-of-domain values for categorical ints

Open kennysong opened this issue 4 years ago • 9 comments

The domain of a categorical int feature is included in my schema as a string_domain (generated by using feature.int_domain.is_categorical = True).

However, when I try to run tfdv.validate_instance() on an example with an out-of-domain value for a categorical int, TFDV doesn't generate any anomalies.

Here's a Colab to reproduce.

kennysong avatar Apr 08 '21 06:04 kennysong

@kennysong , this might originate from the observation in your comment #153.

arghyaganguly avatar Apr 08 '21 12:04 arghyaganguly

Hi Kenny -- TFDV does not support specifying a set of valid values as strings when you have a feature that is of type INT, even when that feature is marked categorical. That is why you are seeing the DOMAIN_INVALID_FOR_TYPE anomaly for HOUR_APPR_PROCESS_START when you've tried to specify a string domain for that feature.

Instead, in your example, you could specify a int_domain.min of 0 and int_domain.max of 23 to try to catch unexpected values for the HOUR_APPR_PROCESS_START feature.

caveness avatar Apr 09 '21 14:04 caveness

Thanks for the clarification!

What’s the best way to validate non-sequential categorical ints in the schema? (E.g. 5 digit zip code)

kennysong avatar Apr 09 '21 15:04 kennysong

Right now, we don't have a way to specify a set of valid non-sequential values in the int domain, since all we offer is the min and max. So, for the zip codes case, you could set the int domain min to the minimum valid zip code, and the max to the max valid zip code and catch things like six-digit codes, etc.

Do you have a use case where this is a problem that you can share with us? (If not, we understand.) Knowing where limitations are tripping up users helps us prioritize expanding TFDV functionality.

caveness avatar Apr 09 '21 15:04 caveness

Got it, I'll let you know if we run into a specific use case where we need integer domains as discrete sets vs. intervals.

Right now, it's just that the default behavior was misleading.

If I specify feature.int_domain.is_categorical = True (which I believe is required for the correct stats generation and other downstream processing), a string_domain is automatically generated. Yet this domain is ignored for validation, which was unexpected and inconsistent with other categorical features.

Is the correct procedure to manually remove the string_domain and then set int_domain.min and int_domain.max?

kennysong avatar Apr 10 '21 02:04 kennysong

Hi Kenny -- Thanks for the clarification on the problem. TFDV shouldn't be inferring an invalid schema in this case; this is a bug on our end.

In the meantime, can you try a workflow using update_schema instead of infer_schema when you generate the schema? When you do so, pass the original schema you used to specify that the feature at issue is categorical as schema. This won't give you validation of the values in the int domain (again, all we can do on that for now is have you manually specify a min/max in the int domain), but it should prevent you from getting an invalid schema that automatically throws an anomaly.

caveness avatar Apr 21 '21 20:04 caveness

Understood, thanks! I'll work around the auto-generated string_domain for now.

kennysong avatar Apr 22 '21 08:04 kennysong

@kennysong

Could you please confirm if this issue can be closed.Thanks

UsharaniPagadala avatar Nov 12 '21 12:11 UsharaniPagadala

@UsharaniPagadala I think the bug still exists, but there is a workaround for now.

kennysong avatar Nov 12 '21 12:11 kennysong

@kennysong

We have updated our TFDV official documentation here with new data validation functions so I would request you to please try with new functions instead of tfdv.validate_instance() to find anomalies in the discrete integer categorical features and please check whether have we taken care that issue in our latest version of TFDV and as you suggested one workaround https://github.com/tensorflow/data-validation/issues/131 so it's working fine and even I reproduced that same code of user here with your same approach.

If you need any further assistance please let us know ? or if your suggested workaround working fine then could you please close this issue ?

Thank you!

gaikwadrahul8 avatar Nov 08 '22 14:11 gaikwadrahul8

Hi @gaikwadrahul8, that workaround you linked is for validating floats, but this issue is about validating categorical integers.

There's still no way to validate categorical integers against a set of values like {123, 345, 569}, instead of inside an integer range like [123, 569].

However, it sounds like this is working as intended and there have been no other user requests, so I'll close this as not planned.

kennysong avatar Nov 10 '22 07:11 kennysong