transformer Hello, thanks for your great works, I'm confused with the dataset.

Hello sir, i'm confused with the dataset, can share the dataset_57M.npz or other demo dataset. I just don't know the dataset's structure.

Apr 16 '22 05:04 StarDxxx

Hello, for the dataset used in these examples, please see #2 . The expected structure of the input data is described in the Transformer's documentation; you can implement your own dataset as long as it matches this input shape.

Apr 25 '22 08:04 maxjcohen

Hello, for the dataset used in these examples, please see #2 . The expected structure of the input data is described in the Transformer's documentation; you can implement your own dataset as long as it matches this input shape.

Hi, I have read the doc. For the inputs and outpurs of the model, I understand those as follows: d_input and d_output are input features and output features. For example, we use PM2.0, PM5 to predict pollution level, so the d_input and d_output are 2 and 1, respectively. However, I don't understand the parameter K in Input and Output tensor with shape (batch_size, K, d_output).

Apr 25 '22 09:04 chuzheng88

In other word, I want to deal with a regression task, it can be described as follows: there are two features in X, and X = [[x01, x02, .., x0j], [x11, x12, ..., x0j]] there is one features in Y (labels) and Y = [y1, y2, ... , yj]. For simple, We use two sequences predict one sequence, like sin and cos funciton predictiing tan function. In this case, how should we construct dataset?

Apr 25 '22 09:04 chuzheng88

K is the length of the time series. In your example K=j, each batch of data should consist of inputs with shape (batch_size, j, 2) and outputs with shape (batch_size, j, 1).

Apr 25 '22 10:04 maxjcohen

K is the length of the time series. In your example K=j, each batch of data should consist of inputs with shape (batch_size, j, 2) and outputs with shape (batch_size, j, 1).

Thanks for you reply. In this case, the parameter attention_size can be set <= K ?

Apr 25 '22 11:04 chuzheng88

Yes exactly !

Apr 25 '22 12:04 maxjcohen

Yes exactly !

Hi, I used dataset X, producted by sin function , to predict Y (producted by cons function), the K was set to 12. When validating, the loss=nan. I don't konw why? Note that whole codes described as follows:

Apr 26 '22 06:04 chuzheng88

Hi, I don't see directly where a NaN could come from, I encourage you to debug during the validation loss computation in order to see what tensor or function is malfunctioning.

Apr 26 '22 08:04 maxjcohen

Hi, I don't see directly where a NaN could come from, I encourage you to debug during the validation loss computation in order to see what tensor or function is malfunctioning.

In fact, when network training, it's loss = nan, e.g.,

In my opinion, when loss_function = OZELoss(alpha=0.3), the training loss shouldn't is nan. But I don't understand why ?

Further more, I used compute_loss function to calculate loss when validating, as follows:

Apr 26 '22 08:04 chuzheng88

Is my dataset wrong?

Apr 26 '22 08:04 chuzheng88