DAZER
DAZER copied to clipboard
The Tensorflow implementation of accepted ACL 2018 paper "A deep relevance model for zero-shot document filtering", Chenliang Li, Wei Zhou, Feng Ji, Yu Duan, Haiqing Chen, http://aclweb.org/anthology/...
DAZER
The Tensorflow implementation of our ACL 2018 paper:
A Deep Relevance Model for Zero-Shot Document Filtering, Chenliang Li, Wei Zhou, Feng Ji, Yu Duan, Haiqing Chen
Paper url: http://aclweb.org/anthology/P18-1214
Requirements
- Python 3.5
- Tensorflow 1.2
- Numpy
- Traitlets
Guide To Use
Prepare your dataset: first, prepare your own data. See Data Preparation
Configure: then, configure the model through the config file. Configurable parameters are listed here
See the example: sample.config
In additional, you need to change the zero-shot label settings in get_label.py
(You need make sure both get_label.py and model.py are put in same directory)
Training : pass the config file, training data and validation data as
python model.py config-file\
--train \
--train_file: path to training data\
--validation_file: path to validation data\
--checkpoint_dir: directory to store/load model checkpoints\
--load_model: True or False(depends on existing or not). Start with a new model or continue training
See example: sample-train.sh
Testing: pass the config file and testing data as
python model.py config-file\
--test \
--test_file: path to testing data\
--test_size: size of testing data (number of testing samples)\
--checkpoint_dir: directory to load trained model\
--output_score_file: file to output documents score\
Relevance scores will be output to output_score_file, one score per line, in the same order as test_file.
Data Preparation
All seed words and documents must be mapped into sequences of integer term ids. Term id starts with 1.
Training Data Format
Each training sample is a tuple of (seed words, postive document, negative document)
seed_words \t postive_document \t negative_document
Example: 334,453,768 \t 123,435,657,878,6,556 \t 443,554,534,3,67,8,12,2,7,9
Testing Data Format
Each testing sample is a tuple of (seed words, document)
seed_words \t document
Example: 334,453,768 \t 123,435,657,878,6,556
Validation Data Format
The format is same as training data format
Label Dict File Format
Each line is a tuple of (label_name, seed_words)
label_name/seed_words
Example: alt.atheism/atheist christian atheism god islamic
Word2id File Format
Each line is a tuple of (word, id)
word id
Example: world 123
Embedding File Format
Each line is a tuple of (id, embedding)
id embedding
Example: 1 0.3 0.4 0.5 0.6 -0.4 -0.2
Configurations
Model Configurations
-
BaseNN.embedding_size
: embedding dimension of word -
BaseNN.max_q_len
: max query length -
BaseNN.max_d_len
: max document length -
DataGenerator.max_q_len
: max query length. Should be the same asBaseNN.max_q_len
-
DataGenerator.max_d_len
: max query length. Should be the same asBaseNN.max_d_len
-
BaseNN.vocabulary_size
: vocabulary size -
DataGenerator.vocabulary_size
: vocabulary size -
BaseNN.batch_size
: batch size -
BaseNN.max_epochs
: max number of epochs to train -
BaseNN.eval_frequency
: evaluate model on validation set very this epochs -
BaseNN.checkpoint_steps
: save model very this epochs
Data
-
DAZER.emb_in
: path of initial embeddings file -
DAZER.label_dict_path
: path of label dict file -
DAZER.word2id_path
: path of word2id file
Training Parameters
-
DAZER.epsilon
: epsilon for Adam Optimizer -
DAZER.embedding_size
: embedding dimension of word -
DAZER.vocabulary_size
: vocabulary size of the dataset -
DAZER.kernal_width
: width of the kernel -
DAZER.kernal_num
: num of kernel -
DAZER.regular_term
: weight of L2 loss -
DAZER.maxpooling_num
: num of K-max pooling -
DAZER.decoder_mlp1_num
: num of hidden units of first mlp in relevance aggregation part -
DAZER.decoder_mlp2_num
: num of hidden units of second mlp in relevance aggregation part -
DAZER.model_learning_rate
: learning rate for model instead of adversarial calssifier -
DAZER.adv_learning_rate
: learning rate for adversarial classfier -
DAZER.train_class_num
: num of class in training time -
DAZER.adv_term
: weight of adversarial loss when updating model's parameters -
DAZER.zsl_num
: num of zero-shot labels -
DAZER.zsl_type
: type of zero-shot label setting ( you may have multiply zero-shot settings in same number of zero-shot label, this indicates which type of zero-shot label setting you pick for experiemnt, see get_label.py for more details )