data-validation
data-validation copied to clipboard
Using tfdv to validate text based data
Hi,
After searching online whether tfdv could be used to validate data that contains text. For instance, for a dataset with sentences that have to be mapped to labels. I could not find any real useful tutorials, as the ones that I could find only go into numerical data regarding the dataset. For instance, height, weights, etc.
After looking around in the data-validation package I have found a couple of files that seem to be related to this. https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/generators/natural_language_stats_generator.py And https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/generators/natural_language_domain_inferring_stats_generator.py
Furthermore on the Tensorflow website about the StatsOptions class I found the following: https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/StatsOptions
Arguments | Description |
---|---|
enable_semantic_domain_stats | If True statistics for semantic domains are generated (e.g: image, text domains). |
semantic_domain_stats_sample_rate | An optional sampling rate for semantic domain statistics. If specified, semantic domain statistics is computed over a sample. |
vocab_paths | An optional dictionary mapping vocab names to paths. Used in the schema when specifying a NaturalLanguageDomain. The paths can either be to GZIP-compressed TF record files that have a tfrecord.gz suffix or to text files. |
These arguments and files do indicate that tfdv can be used to analyze and validate data that would be used in NLP / Text classification type problems.
However, it is unclear to me how one would go about and use these features to validate text-based data?
I have enabled the enable_semantic_domain_stats
argument and this does give information like sequence length etc.
However, how would one extend on this, and validate vocabularies for known/unknown word ratio's; etc.
Any tips or thoughts are highly appreciated! Kind Regards, Caspar