Axial-LOB-High-Frequency-Trading-with-Axial-Attention Can't replicate Paper Performance

Unfortunately I can't replicate the performances stated in the original paper, probably because of the hyperparameters, in fact I don't have enough computational power to execute a hyperparameters search. The max F1-score that I've reached is 79% for k=10 and 78% for k=5.

Apr 02 '23 16:04 LeonardoBerti00

Hahaha that's true. BTW what is the variable T used for in your code? I just found it used to determine the length of the dataset and to output hyper parameter information. But it is not involved in organizing the data or training progress at all. :)

Apr 07 '23 06:04 killa1218

The dataset FI-2010 is constructed taking the snapshot of the LOB every 10 events, so if the real horizon = 50, we need a variable to compute the length of the datasets (train, val and test) that is horizon/10, this variable is T. While h is the variable used only to select the right label column.

Apr 16 '23 17:04 LeonardoBerti00

The paper seems a bit ambiguous as to the network architecture, fig. 1 shows only one axial attention block - is your interpretation that the whole block (in the grey box) is repeated again afterwards?

Did you get best performance with the hyper-parameters in the notebook? With these the model has 20,219 trainable params compared to the 9,615 quoted in the paper - suggesting they used smaller channel dims, or maybe only one block like in fig. 1

Apr 26 '23 16:04 OliverT1

Yes I think so, because they write "The main building component of the proposed model, shown in Fig. 1, is the gated axial attention block, which consists of two layers, each containing two multi-head axial attention modules with gated positional encodings", but I'm not sure. As far as performance is concerned, I haven't been able to do a complete hyperparameter search.

May 02 '23 08:05 LeonardoBerti00

Why self.length = x.shape[0] - T - self.dim + 1 in the class Dataset? Thank you

May 31 '23 13:05 chine007

T Is the horizon/10, i explained in more detail the meaning of the variable in the previous comment. Self.dim is the number of snapshot of the LOB in input for every element of the dataset. To compute the total length we subtract T because for the last T elements we don't have the labels. We subtract self.dim because for the first self.dim (40) elements we can't do the prediction. +1 is for indexing reason. Let me know if now is clear.

May 31 '23 18:05 LeonardoBerti00

T Is the horizon/10, i explained in more detail the meaning of the variable in the previous comment. Self.dim is the number of snapshot of the LOB in input for every element of the dataset. To compute the total length we subtract T because for the last T elements we don't have the labels. We subtract self.dim because for the first self.dim (40) elements we can't do the prediction. +1 is for indexing reason. Let me know if now is clear.

very clear. Thank you

Jun 01 '23 08:06 chine007

T Is the horizon/10, i explained in more detail the meaning of the variable in the previous comment. Self.dim is the number of snapshot of the LOB in input for every element of the dataset. To compute the total length we subtract T because for the last T elements we don't have the labels. We subtract self.dim because for the first self.dim (40) elements we can't do the prediction. +1 is for indexing reason. Let me know if now is clear.

Not sure my understanding is correct. Does T mean the futuer window size to predict (predict horizons in the paper)? If T=5, it means using current self.dim(40) data of the current windw to predict the data of the next 5 window?

Jun 10 '23 08:06 henghamao

yes

Jun 10 '23 10:06 LeonardoBerti00

Thanks for the reply! In addition, where could we find the description of FI-2010 data. The data download link https://etsin.fairdata.fi/dataset/73eb48d7-4dbc-4a10-a52a-da745b47a649/data did not have the descriptions. And we are wondering how to select the right label from the data for different time window. (e.g. used -2 to select the y label in the code)

Jun 10 '23 10:06 henghamao

You can find it in the paper where it was proposed named "Benchmark Dataset for Mid-Price Forecasting of LOB Data with ML". Anyway, -1 is horizon 100, -2 is 50, -3 is 30, -4 is 20 and -5 is 10.

Jun 10 '23 21:06 LeonardoBerti00

I met the same problem that could not replicate the performance as the paper. For running the code, we got the best epoch at epoch 23: Train Loss: 0.7474, Validation Loss: 0.8472, Duration: 3:11:56.469105, Best Val Epoch: 23 And the test dataset evaluation: Test acc: 0.7913 precision recall f1-score support

       0     0.7600    0.7429    0.7514     38447
       1     0.8379    0.8466    0.8422     65996
       2     0.7366    0.7404    0.7385     35100

accuracy                         0.7913    139543

macro avg 0.7782 0.7766 0.7774 139543 weighted avg 0.7910 0.7913 0.7911 139543

What would be the reason to the performance difference from the paper? In the paper, F1 score for k=50 is 83.27. What would be the tunings for the next step? e.g. optimizor, hyper-parameters tuning?

Jun 14 '23 13:06 henghamao

I think these hyperparameters: c_final = 4 #channel output size of the second conv n_heads = 4 c_in_axial = 32 #channel output size of the first conv c_out_axial = 32 pool_kernel = (1, 4) pool_stride = (1, 4) have to be tuned to reach the same performance as the paper. Unfortunately, the model is slow to train.

Jun 14 '23 17:06 LeonardoBerti00

You might try torch.compile() in torch v2.0.

model = AxialLOB(W, dim, c_in_axial, c_out_axial, c_final, n_heads, pool_kernel, pool_stride)
model = torch.compile(model)
model.to(device)

With this command, one step training time reduced from 12 min to 9 min on Google Clob.

Jun 15 '23 10:06 henghamao

thank you

Jun 15 '23 10:06 LeonardoBerti00

Axial-LOB-High-Frequency-Trading-with-Axial-Attention Axial-LOB-High-Frequency-Trading-with-Axial-Attention copied to clipboard

Can't replicate Paper Performance

Axial-LOB-High-Frequency-Trading-with-Axial-Attention
Axial-LOB-High-Frequency-Trading-with-Axial-Attention copied to clipboard