ko-nlp/Open-korean-corpora: Open Korean NLP Dataset Curation for the Users All...

Open Korean Corpora: A Living Document for Korean NLP Dataset Curation

Overview

Korean, a language with 80M users is often overlooked in NLP research
The availability of public datasets and tasks has hindered investigation
Even the publicly available datasets are not always accompanied by English documentation and have poor discoverability
Our work attempts to tackle this problem by curating a living document of open resources for the Korean language

NLP-OSS @ EMNLP 2020

We will be in the live session and monitoring the slide chat during EMNLP 2020. If you have any questions or would simply want to drop by to say hello, please drop by!

Public Institutions

Multiple government-funded institutions create datasets for the Korean language

National Institute of Korean Language (NIKL)
Electronics and Telecommunications Research Institute (ETRI)
NIA AI HUB

Generally, government funded datasets tend to be very restrictive at allowing access to non-Korean citizens

Thus, Open Corpora here denotes a freely accessible and downloadable (at least only with a simple sign-in) dataset

Open Dataset for Korean NLP

Our work focuses on curating open Korean corpora under the following criteria:

Documentation status
License for use and distribution

Documentation and License

For documentation status Docu. the following holds.

doc - Does the corpus have any documentation on the usage?
art - Does the corpus have a related article?
inter - Does the corpus have a internationally available publication?

License

For License, we check the followings:

Commercially available (com), academic use only (acad), unknown (unk)
Redistribution is available with/without modification (rd and rd/mod-x), neither (no), unknown (unk)

Other Attributes

In Providers, we note if the dataset is provided by universities or institutes (Academia), companies or the research group thereof (Industry), or something combined, as Competition purpose.
In Volume, (w) denotes words, (s) denotes sentences, (p) denotes pairs (either document or sentence pairs), (d) denotes dialogues, (h) denotes hours, and (u) denotes speech utterances.
In Goal, Eval is noted if the purpose is suggested as an evaluation.

View at a Glance

The table below describes the open Korean corpora investigated so far. To be updated along with our survey or PR. You can visit Here for the Korean description, and more information regarding government-driven database.

No.	Dataset	Typical Usage	Provider	Docu.	License	Volume	Goal	Lang.
1	KAIST Morpho-Syntactically Annotated Corpus	Morphological analysis	Academia	art	acad/no	70M (w)	-	ko
2	KAIST Korean Tree-Tagging Corpus	Tree parsing	Academia	inter	acad/no	30K (s)	-	ko
3	UD Korean KAIST	Dependency parsing	Academia	inter	acad/no	27K (s)	-	ko
4	PKT-UD	Dependency parsing	Academia	inter	acad/no	5K (s)	-	ko
5	KMOU NER	NER	Academia	art	acad/rd	24K (s)	-	ko
6	AIR x NAVER NER	NER	Competition	doc	acad/no	90K (s)	-	ko
7	AIR x NAVER SRL	SRL	Competition	doc	acad/no	35K (s)	-	ko
8	Question Pair	Paraphrase detection	Academia	doc	com/rd	10K (p)	-	ko
9	KorNLI	NLI	Industry	inter	com/rd	1,000K (p)	-	ko
10	KorSTS	STS	Industry	inter	com/rd	8,500 (p)	-	ko
11	ParaKQC	STS	Academia	inter	com/rd	540K (p)	-	ko
12	NSMC	Sentiment analysis	Academia	doc	com/rd	150K / 50K (s)	-	ko
13	BEEP!	Hate speech detection	Academia	inter	com/rd	8K / 500 / 1,000 (s)	-	ko
14	3i4K	Speech act classification	Academia	inter	com/rd	55K / 6K (s)	-	ko
15	KorQuAD 1.0	QA	Industry	inter	com/rd (mod-x)	60K / 5K / 4K (p)	-	ko
16	KorQuAD 2.0	QA	Industry	art	com/rd (mod-x)	80K / 10K / 10K (p)	-	ko
17	Sci-news-sum-kr	Summarization	Academia	doc	acad/rd	50 (p)	Eval	ko
18	sae4K	Summarization	Academia	inter	com/rd	50K (p)	-	ko
19	Korean Parallel Corpora	MT	Academia	inter	com/red(mod-x)	97K (p)	-	ko, en
20	KAIST Translation Evaluation Set	MT	Academia	doc	acad/no	3K (p)	Eval	ko, en
21	KAIST Chinese-Korean Multilingual Corpus	MT	Academia	doc	acad/no	60K (p)	-	ko, zh
22	Transliteration Dataset	Transliteration	Academia	doc	com/rd	35K (p)	-	ko, en
23	KAIST Transliteration Evaluation Set	Transliteration	Academia	doc	acad/no	7K (p)	Eval	ko, en
24	SIGMORPHON G2P	G2P conversion	Competition	inter	com/rd	3,600 / 450 / 450 (p)	-	ko, en, hy, bg, fr, ka, hi, hu, is, lt, el
25	PAWS-X	Paraphrase detection	Industry	inter	com/rd	5K / 2K / 2K (p)	-	ko, fr, es, de, zh, ja
26	TyDi-QA	QA	Industry	inter	com/rd	11K / 1,698 / 1,722 (p)	-	ko, en, ar, bn, fi, ja, id, sw, ru, te, th
27	XPersona	Dialog	Academia	inter	com/rd	299 (d) / 4,684 (s)	-	ko, en, it, fr, id, zh, ja
28	KSS	ASR	Academia	doc	acad/rd	12+ (h) / 13K (u) / 1 speaker	-	ko
29	Zeroth	ASR	Industry	doc	com/rd	51+ (h) / 27K (s) / 46K (u) / 181 speakers	-	ko
30	ClovaCall	ASR	Industry	inter	acad/no	80+ (h) / 60K (u)/ 11K speakers	-	ko
31	Pansori-TedXKR	ASR	Academia	inter	acad/rd / (mod-x)	3+ (h) / 3K (u)/ 41 speakers	-	ko
32	ProSem	SLU	Academia	inter	com/rd	6+ (h) / 3,500 (s) / 7K (u) / 2 speakers	-	ko

Citing

To cite our work, please use the following: (Also available as cho-etal-2020-open in anthology.bib)

@inproceedings{cho-etal-2020-open,
    title = "Open {K}orean Corpora: A Practical Report",
    author = "Cho, Won Ik  and
      Moon, Sangwhan  and
      Song, Youngsook",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.12",
    pages = "85--93",
    abstract = "Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.",
}

Contributing

Please read the contributor guidelines before sending a pull request.

Open-korean-corpora
Open-korean-corpora copied to clipboard

Metadata

Open Korean Corpora: A Living Document for Korean NLP Dataset Curation

Overview

NLP-OSS @ EMNLP 2020

Public Institutions

Generally, government funded datasets tend to be very restrictive at allowing access to non-Korean citizens

Thus, Open Corpora here denotes a freely accessible and downloadable (at least only with a simple sign-in) dataset

Open Dataset for Korean NLP

Documentation and License

License

Other Attributes

View at a Glance

Citing

Contributing

← Metadata

Owner

Metadata

Open-korean-corpora Open-korean-corpora copied to clipboard

Metadata

Open Korean Corpora: A Living Document for Korean NLP Dataset Curation

Overview

NLP-OSS @ EMNLP 2020

Public Institutions

Generally, government funded datasets tend to be very restrictive at allowing access to non-Korean citizens

Thus, Open Corpora here denotes a freely accessible and downloadable (at least only with a simple sign-in) dataset

Open Dataset for Korean NLP

Documentation and License

License

Other Attributes

View at a Glance

Citing

Contributing

← Metadata

Owner

Metadata

Open-korean-corpora
Open-korean-corpora copied to clipboard