OTMISC-Topic-Modeling-Tool
OTMISC-Topic-Modeling-Tool copied to clipboard
We created a topic modeling pipeline to evaluate different topic modeling algorithms, including their performance on short and long text, preprocessed and not preprocessed datasets, and with different...
OTMISC: Our Topic Modeling Is Super Cool

An advanced topic modeling tool that can do many things!
Introduction
This project is developed by Computer Science and Mathematics master students at TUM (Technical University of Munich) for the course "Master's Practical Course - Machine Learning for Natural Language Processing Applications" in SS22 (Summer Semester 2022). Since this project is still in its infancy, we suggest those who want to use this project to be careful.
-
Project Advisors:
- PhD Candidate (M.Sc.) Miriam Anschütz
- PhD Candidate (M.Sc.) Ahmed Mosharafa
-
Project Scope:
- Evaluating different Topic Modeling algorithms on short/long text dataset.
- Drawing observations on the applicability of certain algorithms’ clusters to different types of datasets.
- Having an outcome including metric-based evaluation, as well as, human based evaluation to the algorithms.
Contributors
Contributor | GitHub Account | Email Address | LinkedIn Account | Other Links |
---|---|---|---|---|
![]() Berk Sudan |
github:berksudan | [email protected] | 🔗 | medium.com/@berksudan |
Ferdinand Kapl |
github:fkapl | [email protected] | - | - |
![]() Yuyin Lang |
github:YuyinLang | [email protected] | 🔗 | - |
Repository structure
-
docs
includes documents for this work, such as task description, final paper, presentations, and literature research. -
data
includes all the datasets used in this work -
notebooks
includes all the demo notebooks (for different algorithms) and one bulk run notebook -
src
includes py files that consist of the pipeline of this work
Project Report and Presentations
- Final Project Report: pdf, LaTeX.
- Presentations:
- Final Presentation: pdf, odp, pptx.
- Midterm Presentation: pdf, odp, pptx.
- Intermediate Presentation - #2: pdf, odp, pptx.
- Intermediate Presentation - #1: pdf, odp, pptx.
Datasets
- Explored the provided datasets to unveil the inherent characteristics.
- Obtained an overview of the statistical characteristics of the datasets.
Available Datasets
Resource Name | Is Suitable? | Type | Contains Tweet Text? | Topic Count | Total Instances | Topic Distribution |
---|---|---|---|---|---|---|
20 News (By Date) | Yes | Long Text Dataset | No | 20 | 853627 | (42K - 45K - 52K - 33K - 30K - 53K - 33K - 35K - 33K - 37K - 45K - 51K - 33K - 45K - 45K - 51K - 46K - 65K - 50K - 33K) |
Yahoo Dataset (60K) | Yes | Long Text Dataset | No | 10 | 60000 | (6K - 6K - 6K - 6K - 6K - 6K - 6K - 6K - 6K - 6K) |
AG News Titles and Texts | Yes | Long Text Dataset | No | 4 | 127600 | (32K - 32K - 32K - 32K) |
CRISIS NLP - Resource #01 | Yes | Short Text Dataset | Yes | 4 | 20514 | (3K - 9K - 4K - 5K) |
CRISIS NLP - Resource #12 | Yes | Short Text Dataset | Yes | 4 | 8007 | (2K - 2K - 2K - 2K) |
CRISIS NLP - Resource #07 | Yes | Short Text Dataset | Yes | 2 | 10941 | (5K - 6K) |
CRISIS NLP - Resource #17 | Yes | Short Text Dataset | Yes | 10 | 76484 | (6K - 5K - 3K - 21K - 8K - 7K - 4K - 12K - 0.5K - 9K) |
AG News Titles | Yes | Short Text Dataset | No | 4 | 127600 | (32K - 32K - 32K - 32K) |
- If you want to see unavailable but analyzed datasets, please visit: unavailable_datasets.md.
Deployment and Run
Build
- For Linux, It is enough to run the following command for setting up virtual environment and install dependencies.
$ ./build_for_linux.sh
- For windows and other operating systems, install
python 3.8
, and install dependencies withpip install -r requirements.txt
. Be careful about the package versions and make sure that you have the correct version in your current set up!
Run
- To run the Jupyter Notebook, just execute the following command:
$ ./run_jupyter.sh
Note: For windows and other operating systems, it can be done via Anaconda or similar tools.
- Then, you can run the notebooks in
./notebooks
. There is one notebook for each algorithm and a general main runner that executes with a config parametrically.
Evaluation Metrics
The following evaluation metrics are used for a metric based assessment of the produced topics:
- Diversity Unique: percentage of unique topic words; in [0,1] and 1 for all different topic words
- Diversity Inverted Rank-Biased Overlap: rank weighted percentage of unique topic words, words at higher ranks are penalized less; in [0,1] and 1 for all different topic words
- Coherence Normalized Pointwise Mutual Information: metric for coherence of topic words, how well do they fit together as topic?; in [-1,1] and 1 for perfect association
- Coherence V: metric for coherence of topic words evaluated by large sliding windows over the text together with indirect cosine similarity based on NPMI; in [0,1] and 1 for perfect association
- Rand Index: similarity measure for the two clusters given by the topic model and the real labels, in [0,1] and 1 for perfect match
References
- Angelov: Top2vec: Distributed representations of topics: https://github.com/ddangelov/Top2Vec
- Grootendorst: BERTopic (https://github.com/MaartenGr/BERTopic)
- OCTIS Framework: https://github.com/MIND-Lab/OCTIS
- Dataset - CRISIS NLP: https://crisisnlp.qcri.org/.
- Dataset - 20NewsGroups: http://qwone.com/~jason/20Newsgroups/.
- Dataset - Yahoo: https://github.com/LC-John/Yahoo-Answers-Topic-Classification-Dataset.
- Csv to Markdown Table Converter #1: https://tableconvert.com/.
- Csv to Markdown Table Converter #2: https://markdown.co/tool/csv-to-markdown-table.