ydata-synthetic icon indicating copy to clipboard operation
ydata-synthetic copied to clipboard

[Question] TimeGAN postprocessing of generated data

Open pbezz1 opened this issue 3 years ago • 3 comments

A question about TimeGAN postprocessing. I'm using TimeGAN to experiment with generation of stock returns of a handful of stocks simultaneously, and I'm using a sequence length of 24 following the seminal paper by Yoon 2019.

After successfully running TimeGAN on the input data I end up with a multi-dimensional array of shape (5000, 24, 10) where 5000 is the generated length, 24 is the sequence length and 10 is the number of stocks.

Now I want to take the generated sequences and produce a matrix (x, 10) where x is the resulting matrix length, so that I can use it in my subsequent experiments. How do I convert (5000, 24, 10) to (x,10). Do I just reshape the array or is there a better way?

pbezz1 avatar Jan 04 '22 20:01 pbezz1

Hello @pbezz1! You are interested in retrieving a single record for each of the learned time series but with an arbitrary length x, is this correct?

This is the implemented sample method that we have in TimeGAN:

    def sample(self, n_samples):
        steps = n_samples // self.batch_size + 1
        data = []
        for _ in trange(steps, desc='Synthetic data generation'):
            Z_ = next(self.get_batch_noise())
            records = self.generator(Z_)
            data.append(records)
        return np.array(np.vstack(data))

The reason for the difficulty you are facing is the architecture of TimeGAN that we have implemented according to the original article. The generator is a recursive model, meaning it keeps a state function in time that intrinsically allows auto-regressive behavior when manipulated for that effect. We use the RNN in a stateless manner, this means it will not keep auto-correlation between different predictions. The state function is performing a different role in the model training. There are other mechanisms to achieve auto-correlation I will not dig into that here. In case you want to learn more about RNNs and auto-correlation I suggest you look up stateful vs stateless RNN operation. Or this TF page about statefulness in RNNs. Solutions to your question come in two forms. Using TimeGAN but instanced with x as your predetermined sequence length. Then whenever you use the sample method you would obtain (batch_size, x, 10) as shape. Any record from the resulting sample should match your request (depending on your intended sequence length this can very well become overwhelming). To achieve your desired output from such a shape you would need a slice operation not reshape, or nothing at all if your batch size is set to 1. The alternative is a different kind of synthesizer where the generator is auto-regressive, i.e. can be conditioned on its previous predictions. There is an open pull request that aims to introduce just that, I don't want to hint on timelines but I would say it is in a very advanced state.

Let me know if this closes the issue, also feel free to join us in the slack community, Cheers!

jfsantos-ds avatar Jan 04 '22 22:01 jfsantos-ds

@jfsantos-ds thanks for your reply. I'm using the sample method to retrieve samples after training the GAN. I'm not understanding this bit "To achieve your desired output from such a shape you would need a slice operation not reshape"

Let me give you a concrete example: My initial timeseries matrix is of shape (10000, 20), after preprocessing becomes (10000, 24, 20). I train a TimeGAN and generate another 2000 samples (with the sample method) which generates a batch of shape (2000, 24, 20). I need to de-sequence the (2000, 24, 20) matrix into (some number, 20) so that my synthetic data has the same shape as my initial timeseries matrix.

Could you guide me on the slicing approach you mentioned above please?

pbezz1 avatar Jan 04 '22 23:01 pbezz1

The sample shape that you are trying to retrieve (x, 20) is basically a single record of length x and the 20 time series. The slice operation is required since the sample will probably have more records than requested depending on the batch size that you define. This is expected because in the sample method above you are retrieving a batch of noise samples, passing a batch of inputs will get you a batch of outputs (more than you asked for). To get to n_samples it is actually returning the minimum number of batches that complies with the requested n_samples. Slicing means cutting the data object and is as easy as this:

sample = synth.sample(1)  # This will get you batch_size samples because it is the only way for this method to retrieve a single record
record = sample[0]  # This will slice the sample to a shape (seq_len, 20)

What I wrote above in order to achieve a fully auto-correlated record with length x still applies. If your sequence length is smaller than x you cannot achieve a fully auto-correlated record with these dimensions with the TimeGAN implementation (you might be able to do it if you play around with state function manipulations a bit). If x is actually smaller than your sequence length you need to slice the sample in two axis like so: record_ = sample[0, :x] # This will slice the sample to the first record and with length x in the time axis

The difference between slicing and reshaping is that reshaping preserves all the information, it just rearranges it (flip an axis p.e.). Slicing typically subsets information.

jfsantos-ds avatar Jan 05 '22 10:01 jfsantos-ds