tweet2vec
tweet2vec copied to clipboard
Implementation of the model presented in our ECIR 2017 paper
Improving Tweet Representations using Temporal and User Context
This repository contains the Torch implementation of our ECIR 2017 work.
Quick Start
Download the user profile attribute dataset from here
Download the Glove word vectors trained on a super-large twitter corpus.
To train our model,
th main.lua
Dependencies
- Torch
- xlua
- tds
- optim
- nnx
- cutorch
- cunn
- cunnx
Packages (b) to (h) can be installed using:
luarocks install <package-name>
Options
th main.lua
-
data_dir
: directory for accessing the user profile prediction data for an attribute (spouse or education or job) [data/spouse/] -
glove_dir
: directory for accesssing the pre-trained glove word embeddings [data/] -
pred_dir
: directory for storing the output (i.e., word, tweet and user embeddings) [predictions/] -
to_lower
: should we change the case of word to lower case [1=yes (default), 0=no] -
wdim
: dimensionality of word embeddings [200] -
wwin
: size of the context window for word context model. add 1 for target word. [21] -
twin
: size of the context window for tweet context model. add 1 for target tweet. [21] -
min_freq
: words that occur less thantimes will not be taken for training [5] -
pad_tweet
: should we need to pad the tweet ? [1=yes (default), 0=no] -
is_word_center_target
: should we model the center word as target. if marked 0, the last word will be considered as target. [0] -
is_tweet_center_target
: should we model the center tweet as target. if marked 0, the last tweet will be considered as target. [1] -
pre_train
: should we initialize word embeddings with pre-trained vectors? [1=yes (default), 0=no] -
wc_mode
: how to get the hidden representation for the word context model? [1=concatenation, 2=sum (default), 3=average, 4=attention based average of the context embeddings] -
tc_mode
: how to get the hidden representation for the tweet context model? [1=concatenation, 2=sum, 3=average, 4=attention based average (default) of the context embeddings] -
tweet
: should we use the tweet based model too? [1=yes (default), 0=no] -
user
: should we use the user based model too? [1=yes, 0=no (default)] -
wpred
: what softmax to use for the final prediction in the word context model? [1=normal (time-consuming for large dataset), 2=hierarchical (default), 3=brown softmax] -
tpred
: what softmax to use for the final prediction in the tweet context model? [1=normal (time-consuming for large dataset), 2=hierarchical (default), 3=brown softmax] -
learning_rate
: learning rate for the gradient descent algorithm [0.001] -
batch_size
: number of sequences to train on in parallel [128] -
max_epochs
: number of full passes through the training data [25]
Author
Licence
MIT