SDV
SDV copied to clipboard
PAR model sampling error when there is a numerical `sequence_index` (float, int)
Environment Details
Please indicate the following details about the environment in which you found the bug:
- SDV version: '0.14.1'
- Python version: Python 3.8.10
- Operating System: Windows 10
Error Description
I am unable to use an integer type field in the PAR models for the index_sequence parameter. I would like to be able to do so so that a PAR model trained with one will have the values for that field with increasingly larger integers be able to be mapped back to a datetime field that has a frequency other than days.
Below is an example where setting index_sequence parameter to an integer value allows for model training, but the model methods all fail, cannot sample:
Steps to reproduce
from sdv.demo import load_timeseries_demo import pandas as pd
data = load_timeseries_demo()
sequence_map = { sorted(data["Date"].unique())[i]: i for i in range(len(data["Date"].unique())) }
data["Date"] = data["Date"].map(sequence_map)
entity_columns = ["Symbol"] context_columns = ["MarketCap", "Sector", "Industry"] sequence_index = "Date"
from sdv.timeseries import PAR
model = PAR( entity_columns=entity_columns, context_columns=context_columns, sequence_index=sequence_index, verbose=True, epochs=45, )
model.fit(data)
In[247]:
throws error
new_data = model.sample(num_sequences=1, sequence_length=10)
Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
Here is the traceback when I run the above:
C:\Users\davidd\Anaconda3\lib\site-packages\scipy\stats\_continuous_distns.py:639: RuntimeWarning: invalid value encountered in sqrt
sk = 2*(b-a)*np.sqrt(a + b + 1) / (a + b + 2) / np.sqrt(a*b)
C:\Users\davidd\Anaconda3\lib\site-packages\scipy\optimize\minpack.py:175: RuntimeWarning: The iteration is not making good progress, as measured by the
improvement from the last ten iterations.
warnings.warn(msg, RuntimeWarning)
C:\Users\davidd\Anaconda3\lib\site-packages\scipy\stats\_continuous_distns.py:5320: RuntimeWarning: divide by zero encountered in true_divide
return c**2 / (c**2 - n**2)
C:\Users\davidd\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:2606: RuntimeWarning: invalid value encountered in double_scalars
Lhat = muhat - Shat*mu
C:\Users\davidd\Anaconda3\lib\site-packages\scipy\optimize\minpack.py:175: RuntimeWarning: The number of calls to function has reached maxfev = 600.
warnings.warn(msg, RuntimeWarning)
PARModel(epochs=45, sample_size=1, cuda='cuda', verbose=True) instance created
Epoch 45 | Loss 1.814377784729004: 100%|███████████████████████████████████████████████| 45/45 [00:30<00:00, 1.46it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.63it/s]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
~\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
~\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'Date'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_21128/1263544826.py in <module>
35 # throws error
36
---> 37 new_data = model.sample(num_sequences=1, sequence_length=10)
~\Anaconda3\lib\site-packages\sdv\timeseries\base.py in sample(self, num_sequences, context, sequence_length)
265
266 sampled = self._sample(context, sequence_length)
--> 267 return self._metadata.reverse_transform(sampled)
268
269 def save(self, path):
~\Anaconda3\lib\site-packages\sdv\metadata\table.py in reverse_transform(self, data)
712 field_data = pd.Series(Table._get_fake_values(field_metadata, len(reversed_data)))
713 else:
--> 714 field_data = reversed_data[name]
715
716 reversed_data[name] = field_data[field_data.notnull()].astype(self._dtypes[name])
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'Date'
Thanks for filing @doolingdavidrs21. I can replicate this issue.
For SDV developers: I did some digging and found the following --
- If sequence index
'Date'
is a datetime, then sampled data has column'Date.value'
, which is reversed back to'Date'
(no issues) - If sequence index
'Date'
is numerical, then sampled data has column'Date.value'
and the reversed name remains'Date.value'
(error)
Potential Workarounds
- If the sequence index is only used for ordering and the data is already in order, you can drop the sequence index
data = data.drop([sequence_index], axis=1)
- Alternatively, you can cast an int column into
datetime
, as proposed by @yamidibarra in #943
import pandas as pd
sequence_index = 'my_sequence_index_column_name' # name of column
data[sequence_index] = pd.to_datetime(data[sequence_index])
Remember to cast the synthetic data back to an int at the end
synthetic_data[sequence_index] = synthetic_data[sequence_index].astype(int)
Thanks for filing @doolingdavidrs21. I can replicate this issue.
For SDV developers: I did some digging and found the following --
- If sequence index
'Date'
is a datetime, then sampled data has column'Date.value'
, which is reversed back to'Date'
(no issues)- If sequence index
'Date'
is numerical, then sampled data has column'Date.value'
and the reversed name remains'Date.value'
(error)
This is because in the PAR
model, only the datetime columns are transformed. This can be seen here:
https://github.com/sdv-dev/SDV/blob/f822903f60f0a983d40fb34b4724912c6eb578d8/sdv/timeseries/base.py#L74-L80
However, in sampling we add the .value
suffix back in for the sequence index no matter what type it is.
https://github.com/sdv-dev/SDV/blob/f822903f60f0a983d40fb34b4724912c6eb578d8/sdv/timeseries/deepecho.py#L141-L143
This is a bug
Great news! This issue has now been resolved in our new SDV 1.0 (Beta!) release. Check it out and let us know if you're still encountering any problems.
Resources:
- New documentation for the PARSynthesizer
- [Tutorial] for PAR