SDV icon indicating copy to clipboard operation
SDV copied to clipboard

PAR cannot handle a sequence of length 1

Open fealho opened this issue 4 years ago • 7 comments

Error Description

There is a bug in the way sdv.timeseries handles time series of length 1.

Steps to reproduce

The code below reproduces the error:

import pandas as pd
from dateutil.parser import parse
from sdv.timeseries import PAR

data = pd.DataFrame({
    "date": parse("2020-01-20"),
    "Close": 128.11,
    "exchange": "NYSE",
})

model = PAR(
    context_columns = "exchange",
    sequence_index = "date"
)

model.fit(data)

fealho avatar Mar 20 '21 16:03 fealho

This looks more like an invalid input problem (if there is only one data point, there is no sequence). We should confirm that this is still a problem and then add a validation with a user-friendly message that rejects the data as invalid.

csala avatar Aug 26 '21 10:08 csala

I can confirm that this issue still persists.

SDV version: 0.15.0 DeepEcho version: 0.3.0.post1

There can be many valid reasons why there is only 1 item in a sequence, especially when that you're recording event streams. One example: if you are recording transactions made by different users, it's not within your control how many transactions each user makes. You may even find that a majority of sequences have length 1.

It would more user-friendly if we could accommodate such datasets without throwing an error. (Either in SDV or directly within the DeepEcho model.)

Stack Trace

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
[<ipython-input-35-dcea4bada6e7>](https://localhost:8080/#) in <module>()
      2 
      3 model = PAR(entity_columns=['subject_id'], sequence_index='time')
----> 4 model.fit(data)
      5 model.sample(num_sequences=2)
[/usr/local/lib/python3.7/dist-packages/sdv/timeseries/base.py](https://localhost:8080/#) in fit(self, timeseries_data)
    208 
    209         LOGGER.debug('Fitting %s model to table %s', self.__class__.__name__, self._metadata.name)
--> 210         self._fit(transformed)
    211 
    212     def get_metadata(self):

[/usr/local/lib/python3.7/dist-packages/sdv/timeseries/deepecho.py](https://localhost:8080/#) in _fit(self, timeseries_data)
     85 
     86         # Validate and fit
---> 87         self._model.fit_sequences(sequences, context_types, data_types)
     88 
     89     def _sample(self, context=None, sequence_length=None):

[/usr/local/lib/python3.7/dist-packages/deepecho/models/par.py](https://localhost:8080/#) in fit_sequences(self, sequences, context_types, data_types)
    316         self._build(sequences, context_types, data_types)
    317         for sequence in sequences:
--> 318             X.append(self._data_to_tensor(sequence['data']))
    319             C.append(self._context_to_tensor(sequence['context']))
    320 

[/usr/local/lib/python3.7/dist-packages/deepecho/models/par.py](https://localhost:8080/#) in _data_to_tensor(self, data)
    211                 elif props['type'] in ['continuous', 'timestamp']:
    212                     mu_idx, sigma_idx, missing_idx = props['indices']
--> 213                     if pd.isnull(data[key][i]) or props['std'] == 0:
    214                         x[mu_idx] = 0.0
    215                     else:

IndexError: list index out of range

npatki avatar Jul 14 '22 17:07 npatki

Workaround

A manual workaround for now would be to split the table into two --

  1. Contains sequences of length 2 or more
  2. Contains sequences only of length 1

You can model table (1) using the PAR model. If you require sequences of length 1, then you can model table (2) using any of the existing single table models in the SDV (GaussianCopula, CTGAN, etc.)

npatki avatar Aug 15 '22 14:08 npatki

this error still happens in case of one sequence length data

Ng-ms avatar Feb 05 '24 14:02 Ng-ms

Thank you for confirming @Ng-ms. We generally will keep a bug open until we work on it and give you a fix.

You can keep an eye on this issue to see when the status changes. Thanks.

npatki avatar Feb 05 '24 15:02 npatki

Just FYI: I just tried it out with my data set and it appears that this problem is already fixed and can be closed - at least when there are multiple sequences with length 1 in the data set.

MarcJohler avatar May 02 '24 08:05 MarcJohler

Thanks @MarcJohler -- I'm marking this one as under discussion while we check it out with some of the demo datasets. Thanks.

npatki avatar May 15 '24 15:05 npatki