data-validation issues

TFDV - Slicing data based on date range

9

In Tensorflow Data Validation, there is a method slicing_util.get_feature_value_slicer() to slice data based on a feature value. Is it possible to slice the data based on a date range using...

srinivasaraov

stat:awaiting tensorflower

type:support

Get formatted schema and anomalies to visualize

5

I'm trying to run tfdv process in Kubeflow Pipeline and visualize the results in the pipeline UI. For statistics, I can easily visualize using `get_statistics_html`. However, for schema and anomalies,...

wakanapo

stat:awaiting tensorflower

type:support

Wrong documentation for validate_instance()

In the [TFDV Get Started](https://www.tensorflow.org/tfx/data_validation/get_started#inferring_a_schema_over_the_data) page, it states that: > TFDV also provides the `validate_instance` function for identifying whether an individual example exhibits anomalies when matched against a schema. To...

kennysong

type:docs

stat:awaiting tensorflower

type:bug

fix a bad link to SysML paper

wendykan

cla: yes

Solution for skew/drift detection in distribution of numerical feature

5

I opened issue #101 about dealing with numerical features due to the need for ML data quality control in my company. I have made small workaround suitable to our pipeline,...

wrapper228

cla: yes

tfdv.validate_tensor_examples()?

4

I think it would be nice to have a top-level function to check for anomalies in serving data. It could be integrated into `serving_input_receiver_fn`. It doesn't make sense to have...

schmidt-jake

stat:awaiting tensorflower

type:feature

The generate_statistics_from_csv very slowly for large dataset in single server

4

Hi According to the tfx examples, I pass the `pipeline_options` to `generate_statistics_from_csv` which set `--direct_num_workers=16` like: ```python pipeline_options = PipelineOptions(['--direct_num_workers=16']) ``` It's seem that this option cannot speed up this...

yajunwong

stat:awaiting tensorflower

type:performance

We can not use INT with missing values?

4

It seems that we can't use INT with missing values. For example, using the schema and the csv below would fail to validate: schema.pbtxt: ``` feature { name: "f1" type:...

sfujiwara

stat:awaiting tensorflower

type:support

GenerateStatistics API Change

Towards the goal of adding support for computing statistics over structured data (e.g., arbitrary protocol buffers, parquet data), `GenerateStatistics` API will take Arrow tables as input instead of `Dict[FeatureName, ndarray]`....

paulgc

Announcement

Newline in CSV quoted string breaks reader

5

Hi, Looks like current CSV reader does not support the case where a quoted string value span a few lines (and line breaks are made). It means a logical CSV...

jondot

stat:awaiting tensorflower

type:feature

data-validation
data-validation copied to clipboard

Metadata

TFDV - Slicing data based on date range

Get formatted schema and anomalies to visualize

Wrong documentation for validate_instance()

fix a bad link to SysML paper

Solution for skew/drift detection in distribution of numerical feature

tfdv.validate_tensor_examples()?

The generate_statistics_from_csv very slowly for large dataset in single server

We can not use INT with missing values?

GenerateStatistics API Change

Newline in CSV quoted string breaks reader

← Metadata

Owner

Metadata

data-validation data-validation copied to clipboard

Metadata

← Metadata

Owner

Metadata

data-validation
data-validation copied to clipboard