data-validation
data-validation copied to clipboard
TFDV - Slicing data based on date range
In Tensorflow Data Validation, there is a method slicing_util.get_feature_value_slicer() to slice data based on a feature value.
Is it possible to slice the data based on a date range using the above method and compare the sliced datasets ?
Let's say, I have 'n' records within date range t1-t10. If I want to split the data into 4 sets which fall in date ranges t1-t3, t4-t6 and t8-t10, is it possible with above slicing method?
@srinivasaraov, Can you please check the source code of get_feature_value_slicer along with the description of that function, and let us know if it helps. Thanks!
@rmothukuru : I see the following documentation in the source code.
Raises:
TypeError: If feature values are not specified in an iterable.
NotImplementedError: If a value of a type other than string or integer is
specified in the values iterable in features.
So, I'm assuming specifying a date range is supported. Is that correct?
what is the type of your date/timestamp feature?
I don't think the default slicer will be able to slice by ranges but you can implement your own slicer. A slicer is just a function that takes a pa.RecordBatch and returns a List[Tuple[Text, pa.RecordBatch]], where the first term in the tuple is the slice key, and the second term is the RecordBatch that contains only rows corresponding to the slice key.
btw, we are looking at allowing using SQL statements to do slicing which may be able to support your use case. However there's no timeline yet.
Thanks @brills
Type of date/timestamp is DATETIME.
Could you please point me to any example of custom slicer implementation if possible?
sorry, which DATETIME type did you mean? I don't think TFDV supports such types (only integral, floating and string/bytes).
Our feature value slicer is no exception than other potential custom slicers: https://github.com/tensorflow/data-validation/blob/a7d783378bd9487f1e8389675967f9e782210312/tensorflow_data_validation/utils/slicing_util.py#L100
Did you eventually implement this slicer yourself, @srinivasaraov ?
@axeltidemann : Not yet. This was deprioritised for us at the moment. I will update when I implement this.
Cool, I think it would be a very useful feature.