Multiterm-Topic-Model
Multiterm-Topic-Model copied to clipboard
Code for Short Text Topic Modeling with Flexible Word Patterns (IJCNN2019)
Code for Short Text Topic Modeling with Flexible Word Patterns
Usage
1. prepare multiterms
python preprocess_multiTerm.py --data_path {data_path} --output_dir {output_dir}
data_path is the path of short texts in the form of
word1 word2...
...
such as
python preprocess_multiTerm.py --data_path data/stackoverflow --output_dir output/stackoverflow
Please NOTE that the words in each short text should keep the original order.
After running, these files will be outputted in output_dir:
- multiTerms
word_id of each multiterm.
word_id word_id...
...
- multiTerms_list
word_id of each distinct multiterm.
word_id word_id...
...
- transformed_multiTerm_texts
Multierms of each text. Each multiterm is made up of word ids.
word_id word_id, word_id word_id, ...
...
- word_index.txt
word_id word
...
- mit_id_text
mit_id mit_id mit_id ...
2. run MTM
java MTM.MultiTermModel {topic_num} {input_path} {output_path} {alpha} {beta} {iteration times}
input_path is the path including the output files of preprocess_multiterm.py.
such as
java MTM.MultiTermModel 20 output/stackoverflow/ output/stackoverflow/topic_20/ 2.0 0.08 500
The following files will be outputted in output_path:
-
top_topics: word ids sorted by p(w|z).
-
top_topics_words: words sorted by p(w|z).
-
pz_d: topic distributions of each text.
Then you can evaluate the topic words with the coherence score.
An example of coherence score output log can be found in output/stackoverflow/stackoverflow_K20.
Citation
If you want to use our code, please cite as
@inproceedings{Wu2019,
author = {Wu, Xiaobao and Li, Chunping},
booktitle = {International Joint Conference on Neural Networks},
title = {{Short Text Topic Modeling with Flexible Word Patterns}},
year = {2019}
}