PAR cannot handle a sequence of length 1
Error Description
There is a bug in the way sdv.timeseries handles time series of length 1.
Steps to reproduce
The code below reproduces the error:
import pandas as pd
from dateutil.parser import parse
from sdv.timeseries import PAR
data = pd.DataFrame({
"date": parse("2020-01-20"),
"Close": 128.11,
"exchange": "NYSE",
})
model = PAR(
context_columns = "exchange",
sequence_index = "date"
)
model.fit(data)
This looks more like an invalid input problem (if there is only one data point, there is no sequence). We should confirm that this is still a problem and then add a validation with a user-friendly message that rejects the data as invalid.
I can confirm that this issue still persists.
SDV version: 0.15.0 DeepEcho version: 0.3.0.post1
There can be many valid reasons why there is only 1 item in a sequence, especially when that you're recording event streams. One example: if you are recording transactions made by different users, it's not within your control how many transactions each user makes. You may even find that a majority of sequences have length 1.
It would more user-friendly if we could accommodate such datasets without throwing an error. (Either in SDV or directly within the DeepEcho model.)
Stack Trace
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
[<ipython-input-35-dcea4bada6e7>](https://localhost:8080/#) in <module>()
2
3 model = PAR(entity_columns=['subject_id'], sequence_index='time')
----> 4 model.fit(data)
5 model.sample(num_sequences=2)
[/usr/local/lib/python3.7/dist-packages/sdv/timeseries/base.py](https://localhost:8080/#) in fit(self, timeseries_data)
208
209 LOGGER.debug('Fitting %s model to table %s', self.__class__.__name__, self._metadata.name)
--> 210 self._fit(transformed)
211
212 def get_metadata(self):
[/usr/local/lib/python3.7/dist-packages/sdv/timeseries/deepecho.py](https://localhost:8080/#) in _fit(self, timeseries_data)
85
86 # Validate and fit
---> 87 self._model.fit_sequences(sequences, context_types, data_types)
88
89 def _sample(self, context=None, sequence_length=None):
[/usr/local/lib/python3.7/dist-packages/deepecho/models/par.py](https://localhost:8080/#) in fit_sequences(self, sequences, context_types, data_types)
316 self._build(sequences, context_types, data_types)
317 for sequence in sequences:
--> 318 X.append(self._data_to_tensor(sequence['data']))
319 C.append(self._context_to_tensor(sequence['context']))
320
[/usr/local/lib/python3.7/dist-packages/deepecho/models/par.py](https://localhost:8080/#) in _data_to_tensor(self, data)
211 elif props['type'] in ['continuous', 'timestamp']:
212 mu_idx, sigma_idx, missing_idx = props['indices']
--> 213 if pd.isnull(data[key][i]) or props['std'] == 0:
214 x[mu_idx] = 0.0
215 else:
IndexError: list index out of range
Workaround
A manual workaround for now would be to split the table into two --
- Contains sequences of length 2 or more
- Contains sequences only of length 1
You can model table (1) using the PAR model. If you require sequences of length 1, then you can model table (2) using any of the existing single table models in the SDV (GaussianCopula, CTGAN, etc.)
this error still happens in case of one sequence length data
Thank you for confirming @Ng-ms. We generally will keep a bug open until we work on it and give you a fix.
You can keep an eye on this issue to see when the status changes. Thanks.
Just FYI: I just tried it out with my data set and it appears that this problem is already fixed and can be closed - at least when there are multiple sequences with length 1 in the data set.
Thanks @MarcJohler -- I'm marking this one as under discussion while we check it out with some of the demo datasets. Thanks.