Open-korean-corpora
Open-korean-corpora copied to clipboard
Open Korean NLP Dataset Curation for the Users All Around the Globe
Open Korean Corpora: A Living Document for Korean NLP Dataset Curation
Overview
- Korean, a language with 80M users is often overlooked in NLP research
- The availability of public datasets and tasks has hindered investigation
- Even the publicly available datasets are not always accompanied by English documentation and have poor discoverability
- Our work attempts to tackle this problem by curating a living document of open resources for the Korean language
NLP-OSS @ EMNLP 2020
We will be in the live session and monitoring the slide chat during EMNLP 2020. If you have any questions or would simply want to drop by to say hello, please drop by!
Public Institutions
Multiple government-funded institutions create datasets for the Korean language
- National Institute of Korean Language (NIKL)
- Electronics and Telecommunications Research Institute (ETRI)
- NIA AI HUB
Generally, government funded datasets tend to be very restrictive at allowing access to non-Korean citizens
Thus, Open Corpora here denotes a freely accessible and downloadable (at least only with a simple sign-in) dataset
Open Dataset for Korean NLP
Our work focuses on curating open Korean corpora under the following criteria:
- Documentation status
- License for use and distribution
Documentation and License
For documentation status Docu. the following holds.
- doc - Does the corpus have any documentation on the usage?
- art - Does the corpus have a related article?
- inter - Does the corpus have a internationally available publication?
License
For License, we check the followings:
- Commercially available (com), academic use only (acad), unknown (unk)
- Redistribution is available with/without modification (rd and rd/mod-x), neither (no), unknown (unk)
Other Attributes
- In Providers, we note if the dataset is provided by universities or institutes (Academia), companies or the research group thereof (Industry), or something combined, as Competition purpose.
- In Volume, (w) denotes words, (s) denotes sentences, (p) denotes pairs (either document or sentence pairs), (d) denotes dialogues, (h) denotes hours, and (u) denotes speech utterances.
- In Goal, Eval is noted if the purpose is suggested as an evaluation.
View at a Glance
The table below describes the open Korean corpora investigated so far. To be updated along with our survey or PR. You can visit Here for the Korean description, and more information regarding government-driven database.
No. | Dataset | Typical Usage | Provider | Docu. | License | Volume | Goal | Lang. |
---|---|---|---|---|---|---|---|---|
1 | KAIST Morpho-Syntactically Annotated Corpus | Morphological analysis | Academia | art | acad/no | 70M (w) | - | ko |
2 | KAIST Korean Tree-Tagging Corpus | Tree parsing | Academia | inter | acad/no | 30K (s) | - | ko |
3 | UD Korean KAIST | Dependency parsing | Academia | inter | acad/no | 27K (s) | - | ko |
4 | PKT-UD | Dependency parsing | Academia | inter | acad/no | 5K (s) | - | ko |
5 | KMOU NER | NER | Academia | art | acad/rd | 24K (s) | - | ko |
6 | AIR x NAVER NER | NER | Competition | doc | acad/no | 90K (s) | - | ko |
7 | AIR x NAVER SRL | SRL | Competition | doc | acad/no | 35K (s) | - | ko |
8 | Question Pair | Paraphrase detection | Academia | doc | com/rd | 10K (p) | - | ko |
9 | KorNLI | NLI | Industry | inter | com/rd | 1,000K (p) | - | ko |
10 | KorSTS | STS | Industry | inter | com/rd | 8,500 (p) | - | ko |
11 | ParaKQC | STS | Academia | inter | com/rd | 540K (p) | - | ko |
12 | NSMC | Sentiment analysis | Academia | doc | com/rd | 150K / 50K (s) | - | ko |
13 | BEEP! | Hate speech detection | Academia | inter | com/rd | 8K / 500 / 1,000 (s) | - | ko |
14 | 3i4K | Speech act classification | Academia | inter | com/rd | 55K / 6K (s) | - | ko |
15 | KorQuAD 1.0 | QA | Industry | inter | com/rd (mod-x) | 60K / 5K / 4K (p) | - | ko |
16 | KorQuAD 2.0 | QA | Industry | art | com/rd (mod-x) | 80K / 10K / 10K (p) | - | ko |
17 | Sci-news-sum-kr | Summarization | Academia | doc | acad/rd | 50 (p) | Eval | ko |
18 | sae4K | Summarization | Academia | inter | com/rd | 50K (p) | - | ko |
19 | Korean Parallel Corpora | MT | Academia | inter | com/red(mod-x) | 97K (p) | - | ko, en |
20 | KAIST Translation Evaluation Set | MT | Academia | doc | acad/no | 3K (p) | Eval | ko, en |
21 | KAIST Chinese-Korean Multilingual Corpus | MT | Academia | doc | acad/no | 60K (p) | - | ko, zh |
22 | Transliteration Dataset | Transliteration | Academia | doc | com/rd | 35K (p) | - | ko, en |
23 | KAIST Transliteration Evaluation Set | Transliteration | Academia | doc | acad/no | 7K (p) | Eval | ko, en |
24 | SIGMORPHON G2P | G2P conversion | Competition | inter | com/rd | 3,600 / 450 / 450 (p) | - | ko, en, hy, bg, fr, ka, hi, hu, is, lt, el |
25 | PAWS-X | Paraphrase detection | Industry | inter | com/rd | 5K / 2K / 2K (p) | - | ko, fr, es, de, zh, ja |
26 | TyDi-QA | QA | Industry | inter | com/rd | 11K / 1,698 / 1,722 (p) | - | ko, en, ar, bn, fi, ja, id, sw, ru, te, th |
27 | XPersona | Dialog | Academia | inter | com/rd | 299 (d) / 4,684 (s) | - | ko, en, it, fr, id, zh, ja |
28 | KSS | ASR | Academia | doc | acad/rd | 12+ (h) / 13K (u) / 1 speaker | - | ko |
29 | Zeroth | ASR | Industry | doc | com/rd | 51+ (h) / 27K (s) / 46K (u) / 181 speakers | - | ko |
30 | ClovaCall | ASR | Industry | inter | acad/no | 80+ (h) / 60K (u)/ 11K speakers | - | ko |
31 | Pansori-TedXKR | ASR | Academia | inter | acad/rd / (mod-x) | 3+ (h) / 3K (u)/ 41 speakers | - | ko |
32 | ProSem | SLU | Academia | inter | com/rd | 6+ (h) / 3,500 (s) / 7K (u) / 2 speakers | - | ko |
Citing
To cite our work, please use the following: (Also available as cho-etal-2020-open
in anthology.bib)
@inproceedings{cho-etal-2020-open,
title = "Open {K}orean Corpora: A Practical Report",
author = "Cho, Won Ik and
Moon, Sangwhan and
Song, Youngsook",
booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.nlposs-1.12",
pages = "85--93",
abstract = "Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.",
}
Contributing
Please read the contributor guidelines before sending a pull request.