tfx
tfx copied to clipboard
data_validation are specific to problem type or generic?
I am looking for tfx data_validation stasticGen or schemaGen. Does these validation are specific to problem type like tabular data, NLP data, Image Data, I am looking for descriptive statics, but as I can see tfx data validations loads the data finds out numeric and categorical features and its min,max,mean std and unique, top etc, which are common, like the way we do EDA for tabular data.
@tiru1930 , please mention your possible use case and the custom eda you might want to have/add in data_validation. Also please provide reproducible code to help us expedite the issue.
Follow the below links for a code based walk-through on what tfdv offers currently :
1)natural language statistics :- tfdv_nlp 2)image statistics :- tfdv_image 3)basic statistics :- basic stats for tabular data
@arghyaganguly does this being used when i call
stats = tfdv.generate_statistics_from_csv(data_location="./imdb.csv")
actually how do we set, to generate nlp featuresc
from tensorflow_data_validation.statistics import stats_options
options = stats_options.StatsOptions(enable_semantic_domain_stats=True)
stats = tfdv.generate_statistics_from_csv(data_location="./imdb.csv",stats_options=options)
and aslo can i convert the results from these apis to json , rather than having it in proto
@arghyaganguly @nikelite @zhitaoli , I have not find any reference where i can generate stats on unstructured data i,e NLP and image, Can u please give any references that i can use
@arghyaganguly @nikelite @zhitaoli , I have not find any reference where i can generate stats on unstructured data i,e NLP and image, Can u please give any references that i can use
@davidzats-eng is working on some additional support on this, so we hope to bring more functionalities soon. Maybe stay in tuned and we'll circle back to this thread once we have those functionalities and documentation?
This issue is one of the few pieces of information I could find on the internet regarding using tfdv to validate NLP-related data. @davidzats-eng have you gotten to creating some examples on how to use these extra features of tfdv.
Specifically, I would like to analyze a dataset containing preprocessed sentences that must be mapped to related tags. Is this something that can be done with these features, or should other paths be taken to validate such datasets?