data-validation icon indicating copy to clipboard operation
data-validation copied to clipboard

TFDV - Slicing data based on date range

Open srinivasaraov opened this issue 5 years ago • 9 comments

In Tensorflow Data Validation, there is a method slicing_util.get_feature_value_slicer() to slice data based on a feature value.

Is it possible to slice the data based on a date range using the above method and compare the sliced datasets ?

Let's say, I have 'n' records within date range t1-t10. If I want to split the data into 4 sets which fall in date ranges t1-t3, t4-t6 and t8-t10, is it possible with above slicing method?

srinivasaraov avatar Jul 09 '20 08:07 srinivasaraov

@srinivasaraov, Can you please check the source code of get_feature_value_slicer along with the description of that function, and let us know if it helps. Thanks!

rmothukuru avatar Jul 09 '20 10:07 rmothukuru

@rmothukuru : I see the following documentation in the source code.

Raises: TypeError: If feature values are not specified in an iterable. NotImplementedError: If a value of a type other than string or integer is specified in the values iterable in features.

So, I'm assuming specifying a date range is supported. Is that correct?

srinivasaraov avatar Jul 10 '20 05:07 srinivasaraov

what is the type of your date/timestamp feature?

I don't think the default slicer will be able to slice by ranges but you can implement your own slicer. A slicer is just a function that takes a pa.RecordBatch and returns a List[Tuple[Text, pa.RecordBatch]], where the first term in the tuple is the slice key, and the second term is the RecordBatch that contains only rows corresponding to the slice key.

brills avatar Jul 10 '20 16:07 brills

btw, we are looking at allowing using SQL statements to do slicing which may be able to support your use case. However there's no timeline yet.

brills avatar Jul 10 '20 16:07 brills

Thanks @brills

Type of date/timestamp is DATETIME.

Could you please point me to any example of custom slicer implementation if possible?

srinivasaraov avatar Jul 11 '20 09:07 srinivasaraov

sorry, which DATETIME type did you mean? I don't think TFDV supports such types (only integral, floating and string/bytes).

Our feature value slicer is no exception than other potential custom slicers: https://github.com/tensorflow/data-validation/blob/a7d783378bd9487f1e8389675967f9e782210312/tensorflow_data_validation/utils/slicing_util.py#L100

brills avatar Jul 14 '20 23:07 brills

Did you eventually implement this slicer yourself, @srinivasaraov ?

axeltidemann avatar Mar 01 '21 15:03 axeltidemann

@axeltidemann : Not yet. This was deprioritised for us at the moment. I will update when I implement this.

srinivasaraov avatar Mar 02 '21 05:03 srinivasaraov

Cool, I think it would be a very useful feature.

axeltidemann avatar Mar 02 '21 06:03 axeltidemann