SDV icon indicating copy to clipboard operation
SDV copied to clipboard

SDV PAR SYNTHESIZER DATA GENERATION

Open SivaKrishna-1996 opened this issue 1 year ago • 5 comments

1.) My input for the PAR model is the Nasdaq data of 50K records which have different sequences. Why the PAR model is not generating new data of 50K records? So, Should I use num_sequnce & sequence_length parameters to get the data of length what I require?

2.) Does the sequence key is mandatory to run PAR model? Which means the input time series data should have a column like (product, country, etc) which distinguish the data?

3.) Why PAR model is tooo slow while fitting & generating the data as well for data sets of size around 50K?

SivaKrishna-1996 avatar Oct 10 '24 13:10 SivaKrishna-1996

Hi @SivaKrishna-1996 👋 Do you mind sharing more about your use case or project goals? That will help us provide better guidance!

  1. Why the PAR model is not generating new data of 50K records? So, Should I use num_sequnce & sequence_length parameters to get the data of length what I require?

Do you mind expanding and clarifying further here what specifically isn't working for you here? All SDV Synthesizers aim to mimic the patterns in your real data when generating synthetic data.

Re: num_sequences and sequence_length: these parameters offer you more control over the number of unique sequences in your synthetic data and how many rows in every sequence the synthetic data must have. If you leave sequence_length black, then the number of rows in each sequence will be determined algorithmically (mirroring the min / max lengths from the real data as best possible) and will usually vary between sequences.

2.) Does the sequence key is mandatory to run PAR model?

We highly recommend supplying a column for the sequence_key so SDV knows how to associate rows in a given sequence together when learning patterns. For a dataset on NASDAQ stocks, that might be the company / stock ticker.

If you don't supply a sequence_key, SDV will treat your real data as a single table as indicated in our documentation and we don't fully support this workflow right now.

3.) Why PAR model is tooo slow while fitting & generating the data as well for data sets of size around 50K?

PARSynthesizer in general is a less mature than our other synthesizers and can take a while to fit. For now, to speed up training time you can provide a smaller dataset for training (either less sequences or less rows per sequence). We've opened this tracking issue around PAR performance: https://github.com/sdv-dev/SDV/issues/1965

srinify avatar Oct 10 '24 14:10 srinify

Thanks for your response @srinify

This is regarding point-1 :

I am training the PAR model on Nasdaq data which have 50K records. I have set sequence key as 'Symbol' column & sequence index as 'Date' column.

After training, when I try to retrieve the new synthetic data with (num_sequence=19, as I have 19 unique symbols) I am expecting to get all the 50K records with new synthetic data because my input length is 50K. But I am not getting all the 50K records. I am getting only 1500 records.

Is this the behaviour of the model itself or Should I change any parameters?

SivaKrishna-1996 avatar Oct 10 '24 15:10 SivaKrishna-1996

When you're reasoning about what PAR Synthesizer will do, I recommend thinking primarily in terms of sequences because PAR internally models closer to the sequence level not the 'total dataset' level.

Let's say your real data has 50k rows, 100 unique symbols, and 500 rows per symbol on average. If you only request 19 sequences (aka rows linked to 19 unique symbols), you should roughly expect 19 * 500 = 9500 rows if you don't set the sequence_length parameter explicitly.

So if total number of rows is important, you can either:

  • choose the same value for num_sequences as the number of unique symbols in your real data , leave sequence_length empty, and SDV should roughly synthesize 50k rows (but again this might take a while)
  • choose some value for num_sequences but then increase the sequence_length such that num_sequences * sequence_length = 50k rows

What's your project or use case out of curiosity?

srinify avatar Oct 10 '24 16:10 srinify

If we used paid version of SDV do we get any other time series algorithms? other than PAR

Do we have any other advantages of the SDV if we get the paid version compared to normal free version of SDV? Like computation etc?

SivaKrishna-1996 avatar Oct 14 '24 09:10 SivaKrishna-1996

Hi @SivaKrishna-1996 if you check out our SDV documentation, you can see which synthesizers require SDV Enterprise (there's a * next to the name): https://docs.sdv.dev/sdv/multi-table-data/modeling/synthesizers

In general, we keep this repo focused on SDV Community edition so if you have more specific questions about SDV Enterprise, you can Contact our Team.

srinify avatar Oct 15 '24 20:10 srinify

Hi there @SivaKrishna-1996 it's been a little while since we've heard from you so I'm going to move forward with closing this issue out! If you have more questions or issues with SDV Community, don't hesitate to open new issues!

srinify avatar Oct 27 '24 19:10 srinify