man-sf-emnlp npy and pickle files

npy and pickle files

Open dotchen opened this issue 2 years ago • 16 comments

Hi,

Thank you for releasing the code.

Can you also release the .npy and pickle files loaded in the training script?

Dec 06 '21 03:12 dotchen

Also, is it possible to share the exact script to generate these files? Many thanks

Dec 06 '21 03:12 dotchen

Hi @dotchen ,

I am also trying to replicate the work. I read the code and I used a small random example to simulate the data. The code seems working. You may need to do the data generation on your own.

I removed all the data loading sections and added:

num_sample = 100
n_stock = 100  # the number of stocks
n_day = 5  # the backward-looking window T
n_tweet_per_day = 1  # max num of tweets per day, I suppose 1 tweet per stock per day
n_price_feat = 3  # price feature dim
n_tweet_feat = 512  # text embedding dim

adj = np.eye(n_stock)  # an adjacency matrix with only self node connection
adj = torch.tensor(adj, dtype=torch.int8)

in train(epoch), I use random data for training:

test_text = torch.tensor(np.random.normal((n_stock, n_day, n_tweet, n_tweet_feat)) ,dtype=torch.float32).cuda()
test_price = torch.tensor(np.random.normal((n_stock, n_day, n_price_feat)) ,dtype=torch.float32).cuda()
test_label = torch.tensor(np.random.choice([0, 1], size=(n_stock, 1)) ,dtype=torch.int8).cuda()

However, there are some typos in the code causing errors, for example: in https://github.com/midas-research/man-sf-emnlp/blob/393fcd91b8aeeb7e806e752dc771c27946bb16e0/train.py#L22 it should be model instead of models in https://github.com/midas-research/man-sf-emnlp/blob/393fcd91b8aeeb7e806e752dc771c27946bb16e0/train.py#L140 it should be args.lr instead of l_r I assume that the author cleaned the code, but did not run it before publishing.

Dec 13 '21 22:12 chenqinkai

Thanks @chenqinkai, were you able to follow the provided links and generate the text/price/label embeddings from real data? I am interested in reproducing their numbers published in the paper.

Dec 13 '21 22:12 dotchen

Hi, @dotchen

The provided link is simply the link to Google's Universal Sentence Encoder, it is easy to use. I was not trying to reproduce the number in the paper, I was applying the method with my own data. But the result I am getting is not great.

It is not difficult to get a result, you simply need to construct three matrices, the definition of each axis of each matrix is in my previous post, but it seems difficult to get the exact numbers.

Otherwise, I don't see anywhere in the code using the validation set, not sure how it is used in the training.

Dec 14 '21 10:12 chenqinkai

but it seems difficult to get the exact numbers. Otherwise, I don't see anywhere in the code using the validation set

So did you not even get good training accuracies? Also, how did you get the graph? The link points to a paper pdf without further instructions

Dec 15 '21 01:12 dotchen

@dotchen The training loss was at least converging, the in-sample accuracy is ok. But it is a training process without validation, and the accuracy on my test set was not good.

I did not understand the graph building from WikiData either, I did not bother using the same graph as in the paper. I tried using correlation matrix of historical returns or GICS sector as the graph instead.

Dec 15 '21 10:12 chenqinkai

@chenqinkai would you mind sharing your code to construct the matrices?

It is not difficult to get a result, you simply need to construct three matrices, the definition of each axis of each matrix is in my previous post, but it seems difficult to get the exact numbers.

Dec 21 '21 04:12 jeremytanjianle

The test label only takes in 2 possible values:

test_label = torch.tensor(np.random.choice([0, 1], size=(n_stock, 1)) ,dtype=torch.int8).cuda()

But from the paper they have quoted that they label >+0.55% for positive class and <-0.5% for negative class. So what about the null class? ie, the observations that fall between -0.5% and +0.55%?

Dec 21 '21 07:12 jeremytanjianle

@vinitrinh I am not working on the same data as the paper, I am applying the method on my own data, so my code will not work for you directly.

But it is really not difficult, for example for twitter data, you first use Universal Sentence Encoder to transform each twitter into a 512*1 vector. You then group these vectors by stock and date. So for each stock and each date, you will have a matrix of (n_tweet, n_tweet_feat), if there is no enough tweets for that day, you pad it with 0 vector. You then add another two dimensions to the matrix to form a tensor of size (n_stock, n_day, n_tweet, n_tweet_feat), as described in my random data generation:

test_text = torch.tensor(np.random.normal((n_stock, n_day, n_tweet, n_tweet_feat)) ,dtype=torch.float32).cuda()

The same for price data and label data.

For the neutral class, I think they are simply removed, as described in https://aclanthology.org/P18-1183.pdf Section 3, paragraph 2.

Dec 22 '21 15:12 chenqinkai

@vinitrinh This is from my understanding of the code, not necessarily mean it's correct, it will be much better if the author can clarify or share his code.

Dec 22 '21 15:12 chenqinkai

@chenqinkai Could you please explain the construction of the graph more clearly? For example, how to construct the graph based on GICS sector? Does it mean the stocks in the same sector has value 1 in the corresponding matrix and 0 otherwise?

I did not understand the graph building from WikiData either, I did not bother using the same graph as in the paper. I tried using correlation matrix of historical returns or GICS sector as the graph instead.

Dec 22 '21 15:12 rloner

@rloner Yes, then you normalize it with D^-1/2

Dec 22 '21 21:12 chenqinkai

@chenqinkai Sorry, but is it necessary to normalize the graph matrix in the GAT model? I think It seems to be unnecessary in the paper?

Dec 23 '21 11:12 rloner

@rloner it is normalized here: https://github.com/midas-research/man-sf-emnlp/blob/393fcd91b8aeeb7e806e752dc771c27946bb16e0/utils.py#L23

But it is a small detail, you can try either way.

Dec 23 '21 11:12 chenqinkai

@chenqinkai Thank you very much! It would be really really nice if you could upload your code. I seem to still have some difficulties in constructing graph.

Dec 23 '21 11:12 rloner

actually I dont think you can reproduce the number in the paper by using the author's codes....If somebody only gives part of his codes with bugs, how could you expect to reproduce the result?

Apr 28 '22 14:04 TongLiu-github

man-sf-emnlp man-sf-emnlp copied to clipboard

npy and pickle files

man-sf-emnlp
man-sf-emnlp copied to clipboard