Support for manual numerical distribution constraints in schema/anomalies
I've been looking through the detectable anomalies and realized that I don't think there's a way to accomplish what I'd like to accomplish, which is enforce a distribution constraint on a feature that isn't related to training/serving skew.
For a concrete example, lets say I have a regular retraining pipeline, and I want to enforce that all of my examples have a roughly 50/50 distribution of a boolean feature. We have a component that will fail our training pipeline if anomalies are detected, but as far as I'm aware, there's no anomaly I can set to enforce this distribution, unless I use the skew feature and create a "golden" set of statistics to compare to, but that seems like a roundabout way of doing it.
To generalize it, if I can manually enforce things like min/max value, it would be useful to also be able to enforce things like feature mean/std deviation or some similar way of thresholding a difference in numeric distribution.
Hi @rclough,
Please go through these links 1 and 2 for thresholding the difference in numeric distribution using jensen_shannon_divergence and L-infinity distance. Let us know if this helps. Thank you!
Note that using Jensen Shannon and L-infinity will cover only comparisons between two datasets and will not handle single-dataset validation (unless you were to use a golden dataset as noted in the original issue).
TFDV does not currently have a good way to do the type of single-dataset distribution validation noted in the issue, but we are aiming to expand our anomaly detection functionality and will take this feature request into account.
@singhniraj08 Thanks, as I noted in the original ticket, and as @caveness clarifies, jensen_shannon_divergence and L-infinity distance only apply to problems with a pre-existing example datasets, so while it is possible to try and use them as a workaround, it's really not ideal for the use case when you want to just manually enforce a specific distribution on a feature.
@caveness, Thank you for the clarification. @rclough, I will make this a feature request as commented by @caveness. Thank you for reporting this issue.
@rclough - TFDV recently added support for custom data validation using SQL. You should be able to use this functionality to do checks like the one described above. Please re-open this if you aren't able to do what you need with custom validation.
Please note that custom data validation is not supported on Windows.