tfx icon indicating copy to clipboard operation
tfx copied to clipboard

data_validation are specific to problem type or generic?

Open tiru1930 opened this issue 4 years ago • 5 comments

I am looking for tfx data_validation stasticGen or schemaGen. Does these validation are specific to problem type like tabular data, NLP data, Image Data, I am looking for descriptive statics, but as I can see tfx data validations loads the data finds out numeric and categorical features and its min,max,mean std and unique, top etc, which are common, like the way we do EDA for tabular data.

tiru1930 avatar Feb 17 '21 10:02 tiru1930

@tiru1930 , please mention your possible use case and the custom eda you might want to have/add in data_validation. Also please provide reproducible code to help us expedite the issue.

Follow the below links for a code based walk-through on what tfdv offers currently :

1)natural language statistics :- tfdv_nlp 2)image statistics :- tfdv_image 3)basic statistics :- basic stats for tabular data

arghyaganguly avatar Feb 17 '21 12:02 arghyaganguly

@arghyaganguly does this being used when i call

stats = tfdv.generate_statistics_from_csv(data_location="./imdb.csv")

actually how do we set, to generate nlp featuresc

from tensorflow_data_validation.statistics import stats_options
options = stats_options.StatsOptions(enable_semantic_domain_stats=True)
stats = tfdv.generate_statistics_from_csv(data_location="./imdb.csv",stats_options=options)

and aslo can i convert the results from these apis to json , rather than having it in proto

tiru1930 avatar Feb 17 '21 13:02 tiru1930

@arghyaganguly @nikelite @zhitaoli , I have not find any reference where i can generate stats on unstructured data i,e NLP and image, Can u please give any references that i can use

tiru1930 avatar Feb 22 '21 11:02 tiru1930

@arghyaganguly @nikelite @zhitaoli , I have not find any reference where i can generate stats on unstructured data i,e NLP and image, Can u please give any references that i can use

@davidzats-eng is working on some additional support on this, so we hope to bring more functionalities soon. Maybe stay in tuned and we'll circle back to this thread once we have those functionalities and documentation?

zhitaoli avatar Feb 22 '21 16:02 zhitaoli

This issue is one of the few pieces of information I could find on the internet regarding using tfdv to validate NLP-related data. @davidzats-eng have you gotten to creating some examples on how to use these extra features of tfdv.

Specifically, I would like to analyze a dataset containing preprocessed sentences that must be mapped to related tags. Is this something that can be done with these features, or should other paths be taken to validate such datasets?

Capsar avatar Jun 07 '22 21:06 Capsar