Indonesian-Twitter-Emotion-Dataset
Indonesian-Twitter-Emotion-Dataset copied to clipboard
Indonesian twitter dataset for emotion classification task
Indonesian-Twitter-Emotion-Dataset
This dataset contains 4.403 Indonesian tweets which are labeled into five emotion classes: love, anger, sadness, joy and fear.
Data Format
Each line consists of a tweet and its respective emotion label separated by semicolon (,). The first line is a header. For a tweet with coma (,) inside the text, there is an quote (" ") to avoid column separation. The tweets in this dataset has been pre-processed using the following criterias:
- Username mention (@) has been replaced with term [USERNAME]
- URL/hyperlink (http://... or https://...) has been replaced with term [URL]
- Sensitive number, such as phone number, invoice number and courier tracking number has been replaced with term [SENSITIVE-NO]
Pre-trained Word Embedding
We have trained 1 Millions Indonesian tweets into Word2Vec and FastText vector. Those pre-trained word embedding can be downloaded here.
Citation
If you want to publish a paper using this dataset and pre-trained word embedding, please cite this publication:
Mei Silviana Saputri, Rahmad Mahendra, and Mirna Adriani, "Emotion Classification on Indonesian Twitter Dataset", in Proceeding of International Conference on Asian Language Processing 2018. 2018.
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.