data-validation icon indicating copy to clipboard operation
data-validation copied to clipboard

GenerateStatistics API Change

Open paulgc opened this issue 6 years ago • 0 comments

Towards the goal of adding support for computing statistics over structured data (e.g., arbitrary protocol buffers, parquet data), GenerateStatistics API will take Arrow tables as input instead of Dict[FeatureName, ndarray]. The API will only accept Arrow tables whose columns are ListArray of primitive types (e.g., int8, int16, int32, int64, uint8, uint16, uint32, uint64, float16, float32, float64, binary, string, unicode) .

This change should be a no-op if you construct the pipeline using the default decoders (e.g., tfdv.DecodeTFExample and tfdv.DecodeCSV) or if you are using the utility methods to generate statistics (e.g., tfdv.generate_statistics_from_tfrecord, tfdv.generate_statistics_from_csv and tfdv.generate_statistics_from_dataframe).

TFDV 0.14 will have this new behavior. Let us know if you have any issues with migrating to the new API.

paulgc avatar Jul 20 '19 02:07 paulgc