NLP Bahasa Indonesia Resources

This repository provides link to useful dataset and another resources for NLP in Bahasa Indonesia.

Last Update: 15 Mar 2022

Corpus
- Named Entity Recognition
- POS-Tagging
- Question and Answering
- Paraphrasing
- Text Summarization
- Hate-speech
- Word Analogy
- Formal-Informal
- Multilingual Parallel
- Unsupervised Corpus
- Voice-Text
- Puisi and Pantun
Dictionary
- Synonym
- Sentiment
- Position or Degree
- Root Words
- Slang Words
- Stop Words
- Swear Words
- Composite Words
- Number Words
- Calendar Words
- Emoticon
- Acronym
- Indonesia Region
- Country
- Region
- Title of Name
- Gender by Name
- Organization
Articles and Papers
- POS-Tagging
- Word Embedding
- Topic Analysis
- Text Classification
Pre-trained Models
Usable Library
Spelling Correction
Twitter Scraping
Other Resources

Corpus

Named Entity Recognition

Product NER. https://github.com/dziem/proner-labeled-text
NER-grit. https://github.com/grit-id/nergrit-corpus

POS-Tagging

IDN Tagged Corpus. https://github.com/famrashel/idn-tagged-corpus
Indonesian Part-of-Speech (POS) Tagging. https://github.com/kmkurn/id-pos-tagging/blob/master/data/dataset.tar.gz

Question and Answering

TydiQA. https://github.com/google-research-datasets/tydiqa

Paraphrasing

Quora Paraphrasing. https://github.com/louisowen6/quora_paraphrasing_id
Paraphrase Adversaries from Word Scrambling. https://github.com/Wikidepia/indonesian_datasets/tree/master/paraphrase/paws

Text Summarization

Indosum. https://github.com/kata-ai/indosum
Liputan6. https://huggingface.co/datasets/id_liputan6

Hate-speech

ID Multi Label Hate Speech. https://github.com/okkyibrohim/id-multi-label-hate-speech-and-abusive-language-detection

Word Analogy

KAWAT. https://github.com/kata-ai/kawat

Formal-Informal

STIF-Indonesia. https://github.com/haryoa/stif-indonesia
IndoCollex. https://github.com/haryoa/indo-collex
https://github.com/okkyibrohim/id-multi-label-hate-speech-and-abusive-language-detection/blob/master/new_kamusalay.csv

Multilingual Parallel

https://huggingface.co/datasets/alt
https://opus.nlpl.eu/bible-uedin.php
http://www.statmt.org/cc-aligned/
https://huggingface.co/datasets/id_panl_bppt
https://huggingface.co/datasets/open_subtitles
https://huggingface.co/datasets/opus100
https://huggingface.co/datasets/tapaco
https://huggingface.co/datasets/wiki_lingua

Unsupervised Corpus

OSCAR. https://oscar-corpus.com/
Online Newspaper. https://github.com/feryandi/Dataset-Artikel
IndoNLU. https://huggingface.co/datasets/indonlu
IndoNLG. https://github.com/indobenchmark/indonlg
IndoNLI. https://github.com/ir-nlp-csui/indonli
IndoBERTweet. https://github.com/indolem/IndoBERTweet
http://data.statmt.org/cc-100/
https://huggingface.co/datasets/id_clickbait
https://huggingface.co/datasets/id_newspapers_2018
https://opus.nlpl.eu/QED.php

Voice-Text

https://huggingface.co/datasets/common_voice
https://huggingface.co/datasets/covost2

Puisi and Pantun

https://github.com/ilhamfp/puisi-pantun-generator

Dictionary

Synonym

https://github.com/victoriasovereigne/tesaurus

Sentiment

(Negative) https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/negatif_ta2.txt
(Negative) https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/negative_add.txt
(Negative) https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/negative_keyword.txt
(Negative) https://github.com/masdevid/ID-OpinionWords/blob/master/negative.txt
(Positive) https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/positif_ta2.txt
(Positive) https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/positive_add.txt
(Positive) https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/positive_keyword.txt
(Positive) https://github.com/masdevid/ID-OpinionWords/blob/master/positive.txt
(Score) https://github.com/agusmakmun/SentiStrengthID/blob/master/id_dict/sentimentword.txt
(InSet Lexicon) https://github.com/fajri91/InSet [Paper]
(Twitter Labelled Sentiment) https://www.researchgate.net/profile/Ridi_Ferdiana/publication/339936724_Indonesian_Sentiment_Twitter_Dataset/data/5e6d64c6a6fdccf994ca18aa/Indonesian-Sentiment-Twitter-Dataset.zip?origin=publicationDetail_linkedData [Paper]
https://huggingface.co/datasets/senti_lex

Position or Degree

https://github.com/panggi/pujangga/blob/master/resource/netagger/contextualfeature/psuf.txt
https://github.com/panggi/pujangga/blob/master/resource/netagger/contextualfeature/lldr.txt
https://github.com/panggi/pujangga/blob/master/resource/netagger/contextualfeature/opos.txt
https://github.com/panggi/pujangga/blob/master/resource/netagger/contextualfeature/ptit.txt

Root Words

https://github.com/agusmakmun/SentiStrengthID/blob/master/id_dict/rootword.txt
https://github.com/sastrawi/sastrawi/blob/master/data/kata-dasar.original.txt
https://github.com/sastrawi/sastrawi/blob/master/data/kata-dasar.txt
https://github.com/prasastoadi/serangkai/blob/master/serangkai/kamus/data/kamus-kata-dasar.csv

I have made the combined root words list from all of the above repositories.

Slang Words

https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/kbba.txt
https://github.com/agusmakmun/SentiStrengthID/blob/master/id_dict/slangword.txt
https://github.com/panggi/pujangga/blob/master/resource/formalization/formalizationDict.txt

I have made the combined slang words dictionary from all of the above repositories.

Stop Words

https://github.com/yasirutomo/python-sentianalysis-id/blob/master/data/feature_list/stopwordsID.txt
https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/stopword.txt
https://github.com/abhimantramb/elang/tree/master/word2vec/utils/stopwords-list

I have made the combined stop words list from all of the above repositories.

Swear Words

https://github.com/abhimantramb/elang/blob/master/word2vec/utils/swear-words.txt

Composite Words

https://github.com/panggi/pujangga/blob/master/resource/tokenizer/compositewords.txt

Number Words

https://github.com/panggi/pujangga/blob/master/resource/netagger/morphologicalfeature/number.txt

Calendar Words

https://github.com/onlyphantom/elang/blob/master/build/lib/elang/word2vec/utils/negative/calendar-words.txt

Emoticon

https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/emoticon.txt
https://github.com/jolicode/emoji-search/blob/master/synonyms/cldr-emoji-annotation-synonyms-id.txt
https://github.com/agusmakmun/SentiStrengthID/blob/master/id_dict/emoticon.txt

Acronym

https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/acronym.txt
https://github.com/panggi/pujangga/blob/master/resource/sentencedetector/acronym.txt
https://id.wiktionary.org/wiki/Lampiran:Daftar_singkatan_dan_akronim_dalam_bahasa_Indonesia#A

Indonesia Region

https://github.com/abhimantramb/elang/blob/master/word2vec/utils/indonesian-region.txt
https://github.com/edwardsamuel/Wilayah-Administratif-Indonesia/tree/master/csv
https://github.com/pentagonal/Indonesia-Postal-Code/tree/master/Csv

Country

https://github.com/panggi/pujangga/blob/master/resource/netagger/contextualfeature/country.txt

Region

https://github.com/panggi/pujangga/blob/master/resource/netagger/contextualfeature/lpre.txt

Title of Name

https://github.com/panggi/pujangga/blob/master/resource/netagger/contextualfeature/ppre.txt

Gender by Name

https://github.com/seuriously/genderprediction/blob/master/namatraining.txt

Organization

https://github.com/panggi/pujangga/blob/master/resource/reference/opre.txt

Articles and Papers

POS-Tagging

https://medium.com/@puspitakaban/pos-tagging-bahasa-indonesia-dengan-flair-nlp-c12e45542860
Manually Tagged Indonesian Corpus [Paper] [GitHub]

Word Embedding

(FastText). https://structilmy.com/2019/08/membuat-model-word-embedding-fasttext-bahasa-indonesia/
(Word2Vec). https://yudiwbs.wordpress.com/2018/03/31/word2vec-wikipedia-bahasa-indonesia-dengan-python-gensim/

Topic Analysis

(Introduction to LSA & LDA). https://monkeylearn.com/blog/introduction-to-topic-modeling/
(Introduction to LDA w/ Code & Tips). https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
(Topic Modeling Methods Comparison Paper). https://thesai.org/Downloads/Volume6No1/Paper_21-A_Survey_of_Topic_Modeling_in_Text_Mining.pdf
(Original LDA Paper). http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
(LDA Python Library). https://pypi.org/project/lda/; https://radimrehurek.com/gensim/models/ldamodel.html; https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
(Original CTM Paper). http://people.ee.duke.edu/~lcarin/Blei2005CTM.pdf
(CTM Python Library). https://pypi.org/project/tomotopy/; https://github.com/kzhai/PyCTM
(Gaussian LDA Paper). https://www.aclweb.org/anthology/P15-1077.pdf
(Gaussian LDA Library). https://github.com/rajarshd/Gaussian_LDA
(Temporal Topic Modeling Comparison Paper). https://thesai.org/Downloads/Volume6No1/Paper_21-A_Survey_of_Topic_Modeling_in_Text_Mining.pdf
(TOT: A Non-Markov Continuous-Time Model of Topical Trends Paper). https://people.cs.umass.edu/~mccallum/papers/tot-kdd06s.pdf
(TOT Library). https://github.com/ahmaurya/topics_over_time
(Example of LDA in Bahasa Project Code). https://github.com/kirralabs/text-clustering

Text Classification

Zero-shot Learning

(Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach) https://arxiv.org/pdf/1909.00161.pdf | https://github.com/yinwenpeng/BenchmarkingZeroShot
(Integrating Semantic Knowledge to Tackle Zero-shot Text Classification) https://arxiv.org/abs/1903.12626 | https://github.com/JingqingZ/KG4ZeroShotText
(Train Once, Test Anywhere: Zero-Shot Learning for Text Classification) https://arxiv.org/abs/1712.05972 | https://amitness.com/2020/05/zero-shot-text-classification/
(Zero-shot Text Classification With Generative Language Models) https://arxiv.org/abs/1912.10165 | https://amitness.com/2020/06/zero-shot-classification-via-generation/
(Zero-shot User Intent Detection via Capsule Neural Networks) https://arxiv.org/abs/1809.00385 | https://github.com/congyingxia/ZeroShotCapsule

Few-shot Learning

(Few-shot Text Classification with Distributional Signatures) https://arxiv.org/pdf/1908.06039.pdf | https://github.com/YujiaBao/Distributional-Signatures
(Few Shot Text Classification with a Human in the Loop) https://katbailey.github.io/talks/Few-shot%20text%20classification.pdf | https://github.com/katbailey/few-shot-text-classification
(Induction Networks for Few-Shot Text Classification) https://arxiv.org/pdf/1902.10482v2.pdf | https://github.com/zhongyuchen/few-shot-learning

Pre-trained Models

Indo-BERT. https://github.com/indobenchmark/indonlu & https://huggingface.co/indobenchmark/indobert-base-p1
Indo-BERTweet. https://github.com/indolem/IndoBERTweet & https://huggingface.co/indolem/indobertweet-base-uncased
Transformer-based Pre-trained Model in Bahasa. https://github.com/cahya-wirawan/indonesian-language-models/tree/master/Transformers
Generate Word-Embedding / Sentence-Embedding using pre-Trained Multilingual Bert model. (https://colab.research.google.com/drive/1yFphU6PW9Uo6lmDly_ud9a6c4RCYlwdX#scrollTo=Zn0n2S-FWZih). P.S: Just change the model using 'bert-base-multilingual-uncased'
https://github.com/meisaputri21/Indonesian-Twitter-Emotion-Dataset. [Paper]
https://github.com/Kyubyong/wordvectors
https://drive.google.com/uc?id=0B5YTktu2dOKKNUY1OWJORlZTcUU&export=download
https://github.com/deryrahman/word2vec-bahasa-indonesia
https://sites.google.com/site/rmyeid/projects/polyglot

Usable Library

Pujangga: Indonesian Natural Language Processing REST API. https://github.com/panggi/pujangga
Sastrawi Stemmer Bahasa Indonesia. https://github.com/sastrawi/sastrawi
NLP-ID. https://github.com/kumparan/nlp-id
MorphInd: Indonesian Morphological Analyzer. http://septinalarasati.com/morphind/
INDRA: Indonesian Resource Grammar. https://github.com/davidmoeljadi/INDRA
Typo Checker. https://github.com/mamat-rahmat/checker_id
Multilingual NLP Package. https://github.com/flairNLP/flair
spaCy [GitHub] [Tutorial]
https://github.com/yohanesgultom/nlp-experiments
https://github.com/yasirutomo/python-sentianalysis-id
https://github.com/riochr17/Analisis-Sentimen-ID
https://github.com/yusufsyaifudin/indonesia-ner

Spelling Correction

You can adjust this code with Bahasa corpus to do the spelling correction

Twitter Scraping

GetOldTweets3. https://github.com/Mottl/GetOldTweets3

Usage:

import GetOldTweets3 as got
tweetCriteria=got.manager.TweetCriteria().setQuerySearch('#CoronaVirusIndonesia').setSince("2020-01-01").setUntil("2020-03-05").setNear("Jakarta, Indonesia").setLang("id")
tweets=got.manager.TweetManager.getTweets(tweetCriteria)
for tweet in tweets:
	print(tweet.username)
	print(tweet.text)
	print(tweet.date)
	print("tweet.to")
	print("tweet.retweets")
	print("tweet.favorites")
	print("tweet.mentions")
	print("tweet.hashtags")
	print("tweet.geo")

Tweepy. http://docs.tweepy.org/en/latest/

Step-by-step how to use Tweepy. https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1

Full List of Tweets Object. https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object

Increasing Tweepy’s standard API search limit. https://bhaskarvk.github.io/2015/01/how-to-use-twitters-search-rest-api-most-effectively./

Other Resources

https://github.com/indonesian-nlp/nlp-resources
https://github.com/irfnrdh/Awesome-Indonesia-NLP
https://github.com/kirralabs/indonesian-NLP-resources
https://huggingface.co/datasets?filter=languages%3Aid&p=0

NLP_bahasa_resources NLP_bahasa_resources copied to clipboard

Metadata

NLP Bahasa Indonesia Resources

Table of contents

Corpus

Named Entity Recognition

POS-Tagging

Question and Answering

Paraphrasing

Text Summarization

Hate-speech

Word Analogy

Formal-Informal

Multilingual Parallel

Unsupervised Corpus

Voice-Text

Puisi and Pantun

Dictionary

Synonym

Sentiment

Position or Degree

Root Words

Slang Words

Stop Words

Swear Words

Composite Words

Number Words

Calendar Words

Emoticon

Acronym

Indonesia Region

Country

Region

Title of Name

Gender by Name

Organization

Articles and Papers

POS-Tagging

Word Embedding

Topic Analysis

Text Classification

Zero-shot Learning

Few-shot Learning

Pre-trained Models

Usable Library

Spelling Correction

Twitter Scraping

Other Resources

← Metadata

Owner

Metadata

NLP_bahasa_resources
NLP_bahasa_resources copied to clipboard