Datetime columns set to Object pandas dtype breaks LSTMDetection
Environment Details
SDMetrics version: 0.14.0 (a user reported 0.11 as well)
Error Description
If your pandas DataFrame contains datetime column(s) that are stored using the object dtype (instead of datetime), this breaks LSTMDetection. This is because object and datetime fields are transformed and handled differently. The error message describes a failed one-hot encoding attempt.
- Relevant sdmetrics code: https://github.com/sdv-dev/SDMetrics/blob/main/sdmetrics/utils.py#L146
- Higher level explanation of how they're processed differently: https://docs.sdv.dev/sdmetrics/metrics/metrics-in-beta/detection-sequential#data-compatibility
Originally raised https://github.com/sdv-dev/SDMetrics/issues/422 and https://github.com/sdv-dev/SDMetrics/issues/580
Workaround
For now, manually cast your datetime columns to the datetime dtype before using LSTMDetection. One quick way is using pandas.to_datetime():
df['date_col_1'] = pd.to_datetime(df['date_col_1'])
Steps to reproduce
GitHub Gist Internal Colab Notebook
Ideal Solution
If the user-provided metadata has datetime columns (e.g. "sdtype": "datetime") , we should convert those columns to the datetime dtype.
- If the column can't be cast to
datetimebut the user claims it can, we should raise a useful error educating them (instead of just bubbling up the pandas error)
Full Stack Trace
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[23], line 3
1 from sdmetrics.timeseries import LSTMDetection
----> 3 LSTMDetection.compute(
4 real_data=df1,
5 synthetic_data=synth_df1,
6 metadata=metadata1,
7 sequence_key=['s_key']
8
9 )
File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/timeseries/detection.py:84](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/timeseries/detection.py#line=83), in TimeSeriesDetectionMetric.compute(cls, real_data, synthetic_data, metadata, sequence_key)
81 ht.fit(real_data.drop(sequence_key, axis=1))
83 real_x = cls._build_x(real_data, ht, sequence_key)
---> 84 synt_x = cls._build_x(synthetic_data, ht, sequence_key)
86 X = pd.concat([real_x, synt_x])
87 y = pd.Series(np.array([0] * len(real_x) + [1] * len(synt_x)))
File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/timeseries/detection.py:42](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/timeseries/detection.py#line=41), in TimeSeriesDetectionMetric._build_x(data, hypertransformer, sequence_key)
40 for entity_id, entity_data in data.groupby(sequence_key):
41 entity_data = entity_data.drop(sequence_key, axis=1)
---> 42 entity_data = hypertransformer.transform(entity_data)
43 entity_data = pd.Series({
44 column: entity_data[column].to_numpy()
45 for column in entity_data.columns
46 }, name=entity_id)
48 X = pd.concat([X, pd.DataFrame(entity_data).T], ignore_index=True)
File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/utils.py:200](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sdmetrics/utils.py#line=199), in HyperTransformer.transform(self, data)
197 elif kind == 'O':
198 # Categorical column.
199 col_data = pd.DataFrame({'field': data[field]})
--> 200 out = transform_info['one_hot_encoder'].transform(col_data).toarray()
201 transformed = pd.DataFrame(
202 out, columns=[f'value{i}' for i in range(np.shape(out)[1])])
203 data = data.drop(columns=[field])
File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/utils/_set_output.py:295](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/utils/_set_output.py#line=294), in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
293 @wraps(f)
294 def wrapped(self, X, *args, **kwargs):
--> 295 data_to_wrap = f(self, X, *args, **kwargs)
296 if isinstance(data_to_wrap, tuple):
297 # only wrap the first output for cross decomposition
298 return_tuple = (
299 _wrap_data_with_container(method, data_to_wrap[0], X, self),
300 *data_to_wrap[1:],
301 )
File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py:1023](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py#line=1022), in OneHotEncoder.transform(self, X)
1018 # validation of X happens in _check_X called by _transform
1019 warn_on_unknown = self.drop is not None and self.handle_unknown in {
1020 "ignore",
1021 "infrequent_if_exist",
1022 }
-> 1023 X_int, X_mask = self._transform(
1024 X,
1025 handle_unknown=self.handle_unknown,
1026 force_all_finite="allow-nan",
1027 warn_on_unknown=warn_on_unknown,
1028 )
1030 n_samples, n_features = X_int.shape
1032 if self._drop_idx_after_grouping is not None:
File [~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py:213](http://localhost:8888/lab/tree/issues/github/422/~/.pyenv/versions/sdv_latest/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py#line=212), in _BaseEncoder._transform(self, X, handle_unknown, force_all_finite, warn_on_unknown, ignore_category_indices)
208 if handle_unknown == "error":
209 msg = (
210 "Found unknown categories {0} in column {1}"
211 " during transform".format(diff, i)
212 )
--> 213 raise ValueError(msg)
214 else:
215 if warn_on_unknown:
ValueError: Found unknown categories ['1961-05-27', '1909-11-03', '1967-11-28', '1969-08-08', '1918-11-02', '1952-01-24', '1947-12-26', '1981-06-01', '1954-03-04', '1936-11-13'] in column 0 during transform