data-validation
data-validation copied to clipboard
Library for exploring and validating machine learning data
In Tensorflow Data Validation, there is a method slicing_util.get_feature_value_slicer() to slice data based on a feature value. Is it possible to slice the data based on a date range using...
I'm trying to run tfdv process in Kubeflow Pipeline and visualize the results in the pipeline UI. For statistics, I can easily visualize using `get_statistics_html`. However, for schema and anomalies,...
In the [TFDV Get Started](https://www.tensorflow.org/tfx/data_validation/get_started#inferring_a_schema_over_the_data) page, it states that: > TFDV also provides the `validate_instance` function for identifying whether an individual example exhibits anomalies when matched against a schema. To...
I opened issue #101 about dealing with numerical features due to the need for ML data quality control in my company. I have made small workaround suitable to our pipeline,...
I think it would be nice to have a top-level function to check for anomalies in serving data. It could be integrated into `serving_input_receiver_fn`. It doesn't make sense to have...
Hi According to the tfx examples, I pass the `pipeline_options` to `generate_statistics_from_csv` which set `--direct_num_workers=16` like: ```python pipeline_options = PipelineOptions(['--direct_num_workers=16']) ``` It's seem that this option cannot speed up this...
It seems that we can't use INT with missing values. For example, using the schema and the csv below would fail to validate: schema.pbtxt: ``` feature { name: "f1" type:...
Towards the goal of adding support for computing statistics over structured data (e.g., arbitrary protocol buffers, parquet data), `GenerateStatistics` API will take Arrow tables as input instead of `Dict[FeatureName, ndarray]`....
Hi, Looks like current CSV reader does not support the case where a quoted string value span a few lines (and line breaks are made). It means a logical CSV...