SDV icon indicating copy to clipboard operation
SDV copied to clipboard

PAR model sampling error when there is a numerical `sequence_index` (float, int)

Open doolingdavidrs21 opened this issue 2 years ago • 4 comments

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: '0.14.1'
  • Python version: Python 3.8.10
  • Operating System: Windows 10

Error Description

I am unable to use an integer type field in the PAR models for the index_sequence parameter. I would like to be able to do so so that a PAR model trained with one will have the values for that field with increasingly larger integers be able to be mapped back to a datetime field that has a frequency other than days.

Below is an example where setting index_sequence parameter to an integer value allows for model training, but the model methods all fail, cannot sample:

Steps to reproduce

from sdv.demo import load_timeseries_demo import pandas as pd

data = load_timeseries_demo()

sequence_map = { sorted(data["Date"].unique())[i]: i for i in range(len(data["Date"].unique())) }

data["Date"] = data["Date"].map(sequence_map)

entity_columns = ["Symbol"] context_columns = ["MarketCap", "Sector", "Industry"] sequence_index = "Date"

from sdv.timeseries import PAR

model = PAR( entity_columns=entity_columns, context_columns=context_columns, sequence_index=sequence_index, verbose=True, epochs=45, )

model.fit(data)

In[247]:

throws error

new_data = model.sample(num_sequences=1, sequence_length=10)

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

doolingdavidrs21 avatar May 20 '22 16:05 doolingdavidrs21

Here is the traceback when I run the above:

C:\Users\davidd\Anaconda3\lib\site-packages\scipy\stats\_continuous_distns.py:639: RuntimeWarning: invalid value encountered in sqrt
  sk = 2*(b-a)*np.sqrt(a + b + 1) / (a + b + 2) / np.sqrt(a*b)
C:\Users\davidd\Anaconda3\lib\site-packages\scipy\optimize\minpack.py:175: RuntimeWarning: The iteration is not making good progress, as measured by the 
  improvement from the last ten iterations.
  warnings.warn(msg, RuntimeWarning)
C:\Users\davidd\Anaconda3\lib\site-packages\scipy\stats\_continuous_distns.py:5320: RuntimeWarning: divide by zero encountered in true_divide
  return c**2 / (c**2 - n**2)
C:\Users\davidd\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:2606: RuntimeWarning: invalid value encountered in double_scalars
  Lhat = muhat - Shat*mu
C:\Users\davidd\Anaconda3\lib\site-packages\scipy\optimize\minpack.py:175: RuntimeWarning: The number of calls to function has reached maxfev = 600.
  warnings.warn(msg, RuntimeWarning)
PARModel(epochs=45, sample_size=1, cuda='cuda', verbose=True) instance created
Epoch 45 | Loss 1.814377784729004: 100%|███████████████████████████████████████████████| 45/45 [00:30<00:00,  1.46it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.63it/s]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

~\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

~\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Date'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_21128/1263544826.py in <module>
     35 # throws error
     36 
---> 37 new_data = model.sample(num_sequences=1, sequence_length=10)

~\Anaconda3\lib\site-packages\sdv\timeseries\base.py in sample(self, num_sequences, context, sequence_length)
    265 
    266         sampled = self._sample(context, sequence_length)
--> 267         return self._metadata.reverse_transform(sampled)
    268 
    269     def save(self, path):

~\Anaconda3\lib\site-packages\sdv\metadata\table.py in reverse_transform(self, data)
    712                 field_data = pd.Series(Table._get_fake_values(field_metadata, len(reversed_data)))
    713             else:
--> 714                 field_data = reversed_data[name]
    715 
    716             reversed_data[name] = field_data[field_data.notnull()].astype(self._dtypes[name])

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'Date'
​

doolingdavidrs21 avatar May 20 '22 16:05 doolingdavidrs21

Thanks for filing @doolingdavidrs21. I can replicate this issue.

For SDV developers: I did some digging and found the following --

  • If sequence index 'Date' is a datetime, then sampled data has column 'Date.value', which is reversed back to 'Date' (no issues)
  • If sequence index 'Date' is numerical, then sampled data has column 'Date.value' and the reversed name remains'Date.value' (error)

npatki avatar May 20 '22 16:05 npatki

Potential Workarounds

  1. If the sequence index is only used for ordering and the data is already in order, you can drop the sequence index
data = data.drop([sequence_index], axis=1)
  1. Alternatively, you can cast an int column into datetime, as proposed by @yamidibarra in #943
import pandas as pd

sequence_index = 'my_sequence_index_column_name' # name of column
data[sequence_index] = pd.to_datetime(data[sequence_index]) 

Remember to cast the synthetic data back to an int at the end

synthetic_data[sequence_index] = synthetic_data[sequence_index].astype(int)

npatki avatar Aug 11 '22 20:08 npatki

Thanks for filing @doolingdavidrs21. I can replicate this issue.

For SDV developers: I did some digging and found the following --

  • If sequence index 'Date' is a datetime, then sampled data has column 'Date.value', which is reversed back to 'Date' (no issues)
  • If sequence index 'Date' is numerical, then sampled data has column 'Date.value' and the reversed name remains'Date.value' (error)

This is because in the PAR model, only the datetime columns are transformed. This can be seen here: https://github.com/sdv-dev/SDV/blob/f822903f60f0a983d40fb34b4724912c6eb578d8/sdv/timeseries/base.py#L74-L80

However, in sampling we add the .value suffix back in for the sequence index no matter what type it is. https://github.com/sdv-dev/SDV/blob/f822903f60f0a983d40fb34b4724912c6eb578d8/sdv/timeseries/deepecho.py#L141-L143 This is a bug

amontanez24 avatar Oct 04 '22 23:10 amontanez24

Great news! This issue has now been resolved in our new SDV 1.0 (Beta!) release. Check it out and let us know if you're still encountering any problems.

Resources:

npatki avatar Mar 09 '23 23:03 npatki