awesome-japanese-nlp-resources

A curated list of resources dedicated to Python libraries, pre-trained models, dictionaries, and corpora of NLP for Japanese

Your contributions are always welcome! Please read the Contribution guidelines before contributing.

Python library
- Morphology analysis
- Parsing
- Converter
- Preprocessor
- Sentence spliter
- Sentiment analysis
- Machine translation
- Named entity recognition
- OCR
- Tool for pretrained models
- Others
Rust crate
- Morphology analysis
- Converter
- Search engine library
Pretrained model
- Word2Vec
- Transformer based models
Dictionary
Corpus
Tutorial
Research summary
Reference
Contributors

Python library

Morphology analysis

sudachi.rs - SudachiPy 0.6* and above are developed as Sudachi.rs.
Janome - Japanese morphological analysis engine written in pure Python
mecab-python3 - mecab-python. mecab-python. you can find original version here:http://taku910.github.io/mecab/
mecab - This repository is for building Windows 64-bit MeCab binary and improving MeCab Python binding.
fugashi - A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
nagisa - A Japanese tokenizer based on recurrent neural networks
pyknp - A Python Module for JUMAN++/KNP
Mykytea-python - Python wrapper for KyTea
konoha - Konoha: Simple wrapper of Japanese Tokenizers
natto-py - natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.
rakutenma-python - Rakuten MA (Python version)
python-vaporetto - Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
dango - An easy to use tokenizer for Japanese text, aimed at language learners and non-linguists
rhoknp - Yet another Python binding for Juman++/KNP

Name	downloads/week	total downloads	stars
SudachiPy
Janome
mecab-python3
mecab
fugashi
nagisa
pyknp
Mykytea-python
konoha
natto-py
rakutenma-python
python-vaporetto
dango
rhoknp

Parsing

ginza - A Japanese NLP Library using spaCy as framework based on Universal Dependencies
cabocha - Yet Another Japanese Dependency Structure Analyzer
UniDic2UD - Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese
camphr - Camphr - NLP libary for creating pipeline components
SuPar-UniDic - Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese with BERT models
depccg - A* CCG Parser with a Supertag and Dependency Factored Model
bertknp - A Japanese dependency parser based on BERT
esupar - Tokenizer POS-Tagger and Dependency-parser with BERT/RoBERTa/DeBERTa models for Japanese and other languages

Name	downloads/week	total downloads
ginza
cabocha
UniDic2UD
camphr
SuPar-UniDic
depccg
bertknp	-	-
esupar

Converter

pykakasi - Lightweight converter from Japanese Kana-kanji sentences into Kana-Roman.
cutlet - Japanese to romaji converter in Python
alphabet2kana - Convert English alphabet to Katakana
Convert-Numbers-to-Japanese - Converts Arabic numerals, or 'western' style numbers, to a Japanese context.
mozcpy - Mozc for Python: Kana-Kanji converter

Name	downloads/week	total downloads
pykakasi
cutlet
alphabet2kana
Convert-Numbers-to-Japanese	-	-
mozcpy

Preprocessor

neologdn - Japanese text normalizer for mecab-neologd
jaconv - Pure-Python Japanese character interconverter for Hiragana, Katakana, Hankaku, and Zenkaku
mojimoji - A fast converter between Japanese hankaku and zenkaku characters
text-cleaning - A powerful text cleaner for Japanese web texts

Name	downloads/week	total downloads
neologdn
jaconv
mojimoji
text-cleaning	-	-

Sentence spliter

Bunkai - Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)
japanese-sentence-breaker - Japanese Sentence Breaker
sengiri - Yet another sentence-level tokenizer for the Japanese text
budoux - Standalone. Small. Language-neutral. BudouX is the successor to Budou, the machine learning powered line break organizer tool.
ja_sentence_segmenter - japanese sentence segmentation library for python
hasami - A tool to perform sentence segmentation on Japanese text
kuzukiri - Japanese Text Segmenter for Python written in Rust

Name	downloads/week	total downloads	stars
bunkai
japanese-sentence-breaker
sengiri
budoux
ja_sentence_segmenter
hasami
kuzukiri

Sentiment analysis

oseti - Dictionary based Sentiment Analysis for Japanese
negapoji - Japanese negative positive classification.日本語文書のネガポジを判定。
pymlask - Emotion analyzer for Japanese text
asari - Japanese sentiment analyzer implemented in Python.

Name	downloads/week	total downloads
oseti
negapoji	-	-
pymlask
asari

Machine translation

jparacrawl-finetune - An example usage of JParaCrawl pre-trained Neural Machine Translation (NMT) models.
JASS - JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation (LREC2020) & Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation (ACM TALLIP)
PheMT - A phenomenon-wise evaluation dataset for Japanese-English machine translation robustness. The dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena; Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant. COLING 2020.
VISA - An ambiguous subtitles dataset for visual scene-aware machine translation

Name	downloads/week	total downloads
jparacrawl-finetune	-	-
JASS	-	-
PheMT	-	-
VISA	-	-

Named entity recognition

namaco - Character Based Named Entity Recognition.
entitypedia - Entitypedia is an Extended Named Entity Dictionary from Wikipedia.
noyaki - Converts character span label information to tokenized text-based label information.
bert-japanese-ner-finetuning - Code to perform finetuning of the BERT model. BERTモデルのファインチューニングで固有表現抽出用タスクのモデルを作成・使用するサンプルです
joint-information-extraction-hs - 詳細なアノテーション基準に基づく症例報告コーパスからの固有表現及び関係の抽出精度の推論を行うコード

Name	downloads/week	total downloads
namaco	-	-
entitypedia	-	-
noyaki
bert-japanese-ner-finetuning	-	-
joint-information-extraction-hs	-	-

OCR

Manga OCR - About Optical character recognition for Japanese text, with the main focus being Japanese manga
mokuro - Read Japanese manga inside browser with selectable text.
handwritten-japanese-ocr - Handwritten Japanese OCR demo using touch panel to draw the input text using Intel OpenVINO toolkit
OCR_Japanease - 日本語OCR
ndlocr_cli - NDLOCRのアプリケーション
donut - Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
JMTrans - manga translator - get japanese manga from url to translate manga image
Kindai-OCR - OCR system for recognizing modern Japanese magazines

Name	downloads/week	total downloads
manga-ocr
mokuro
handwritten-japanese-ocr	-	-
OCR_Japanease	-	-
ndlocr_cli	-	-
donut
JMTrans	-	-
Kindai-OCR	-	-

Tool for pretrained models

JGLUE - JGLUE: Japanese General Language Understanding Evaluation
ginza-transformers - Use custom tokenizers in spacy-transformers
t5_japanese_dialogue_generation - T5による会話生成
japanese_text_classification - To investigate various DNN text classifiers including MLP, CNN, RNN, BERT approaches.
Japanese-BERT-Sentiment-Analyzer - Deploying sentiment analysis server with FastAPI and BERT
jmlm_scoring - Masked Language Model-based Scoring for Japanese and Vietnamese
allennlp-shiba-model - AllenNLP integration for Shiba: Japanese CANINE model
evaluate_japanese_w2v - script to evaluate pre-trained Japanese word2vec model on Japanese similarity dataset
gector-ja - BERT-based GEC tagging for Japanese
Japanese-BPEEncoder - Japanese-BPEEncoder
Japanese-BPEEncoder_V2 - Japanese-BPEEncoder Version 2
transformer-copy - 日本語文法誤り訂正ツール

Name	downloads/week	total downloads
JGLUE	-	-
ginza-transformers
t5_japanese_dialogue_generation	-	-
japanese_text_classification	-	-
Japanese-BERT-Sentiment-Analyzer	-	-
jmlm_scoring	-	-
allennlp-shiba-model
evaluate_japanese_w2v	-	-
gector-ja	-	-
Japanese-BPEEncoder	-	-
Japanese-BPEEncoder_V2	-	-
transformer-copy	-	-

Others

namedivider-python - A tool for dividing the Japanese full name into a family name and a given name.
asa-python - A curated list of resources dedicated to Python libraries of NLP for Japanese
python_asa - python版日本語意味役割付与システム（ASA）
toiro - A comparison tool of Japanese tokenizers
ja-timex - 自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器
JapaneseTokenizers - A set of metrics for feature selection from text data
daaja - This repository has implementations of data augmentation for NLP for Japanese.
accel-brain-code - The purpose of this repository is to make prototypes as case study in the context of proof of concept(PoC) and research and development(R&D) that I have written in my website. The main research topics are Auto-Encoders in relation to the representation learning, the statistical machine learning for energy-based models, adversarial generation net…
kyoto-reader - A processor for KyotoCorpus, KWDLC, and AnnotatedFKCCorpus
nlplot - Visualization Module for Natural Language Processing
rake-ja - Rapid Automatic Keyword Extraction algorithm for Japanese
jel - Japanese Entity Linker.
MedNER-J - Latest version of MedEX/J (Japanese disease name extractor)
zunda-python - Zunda: Japanese Enhanced Modality Analyzer client for Python.
AIO2_DPR_baseline - https://www.nlp.ecei.tohoku.ac.jp/projects/aio/
showcase - A PyTorch implementation of the Japanese Predicate-Argument Structure (PAS) analyser presented in the paper of Matsubayashi & Inui (2018) with some improvements.
darts-clone-python - Darts-clone python binding
jrte-corpus_example - Example codes for Japanese Realistic Textual Entailment Corpus
desuwa - Feature annotator to morphemes and phrases based on KNP rule files (pure-Python)
HotPepperGourmetDialogue - Restaurant Search System through Dialogue in Japanese.
nlp-recipes-ja - Samples codes for natural language processing in Japanese
Japanese_nlp_scripts - Small example scripts for working with Japanese texts in Python
DNorm-J - Japanese version of DNorm
pyknp-eventgraph - EventGraph is a development platform for high-level NLP applications in Japanese.
ishi - Ishi: A volition classifier for Japanese
python-npylm - ベイズ階層言語モデルによる教師なし形態素解析
python-npycrf - 条件付確率場とベイズ階層言語モデルの統合による半教師あり形態素解析
unsupervised-pos-tagging - 教師なし品詞タグ推定
negima - Negima is a Python package to extract phrases in Japanese text by using the part-of-speeches based rules you defined.
YouyakuMan - Extractive summarizer using BertSum as summarization model
japanese-numbers-python - A parser for Japanese number (Kanji, arabic) in the natural language.
kantan - Lookup japanese words by radical patterns
make-meidai-dialogue - Get Japanese dialogue corpus
japanese_summarizer - A summarizer for Japanese articles.
chirptext - ChirpText is a collection of text processing tools for Python.
yubin - Japanese Address Munger
jawiki-cleaner - Japanese Wikipedia Cleaner
japanese2phoneme - A python library to convert Japanese to phoneme.
anlp_nlp2021_d3-1 - This repository contains codes related to the experiments in "An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification"
aozora_classification - About This project aims to classify Japanese sentence to how well similar to some Japanese classical writers, such as Soseki Natsume, Ogai Mori, Ryunosuke Akutagawa and so on.
aozora-corpus-generator - Generates plain or tokenized text files from the Aozora Bunko
JLM - A fast LSTM Language Model for large vocabulary language like Japanese and Chinese
NTM - Testing of Neural Topic Modeling for Japanese articles
EN-JP-ML-Lexicon - This is a English-Japanese lexicon for Machine Learning and Deep Learning terminology.
text-generation - Easy-to-use scripts to fine-tune GPT-2-JA with your own texts, to generate sentences, and to tweet them automatically.
chainer_nic - Neural Image Caption (NIC) on chainer, its pretrained models on English and Japanese image caption datasets.
unihan-lm - The official repository for "UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database", AACL-IJCNLP 2020
mbart-finetuning - Code to perform finetuning of the mBART model.
xvector_jtubespeech - xvector model on jtubespeech
TinySegmenterMaker - TinySegmenter用の学習モデルを自作するためのツール．
Grongish - 日本語とグロンギ語の相互変換スクリプト
WordCloud-Japanese - WordCloudでの日本語文章をMecab（形態素解析エンジン）を使用せずに形態素解析チックな表示を実現するスクリプト
snark - 日本語ワードネットを利用したDBアクセスライブラリ
toEmoji - 日本語文を絵文字だけの文に変換するなにか
kokkosho_data - 専門用語抽出アルゴリズムの実装の練習
JDT-with-KenLM-scoring - Japanese-Dialog-Transformerの応答候補に対して、KenLMによるN-gram言語モデルでスコアリングし、フィルタリング若しくはリランキングを行う。
mixture-of-unigram-model - Mixture of Unigram Model and Infinite Mixture of Unigram Model in Python. (混合ユニグラムモデルと無限混合ユニグラムモデル)
hidden-markov-model - Hidden Markov Model (HMM) and Infinite Hidden Markov Model (iHMM) in Python. (隠れマルコフモデルと無限隠れマルコフモデル)
Ngram-language-model - Ngram language model in Python. (Nグラム言語モデル)
ASRDeepSpeech - Automatic Speech Recognition with deepspeech2 model in pytorch with support from Zakuro AI.
neural_ime - Neural IME: Neural Input Method Engine
neural_japanese_transliterator - Can neural networks transliterate Romaji into Japanese correctly?

Name	downloads/week	total downloads
namedivider-python
asa-python
python_asa	-	-
toiro
ja-timex
JapaneseTokenizers	-	-
daaja
accel-brain-code
JGLUE	-	-
kyoto-reader
nlplot
rake-ja	-	-
jel
MedNER-J	-	-
zunda-python
AIO2_DPR_baseline	-	-
showcase
darts-clone-python
jrte-corpus_example	-	-
desuwa
HotPepperGourmetDialogue	-	-
nlp-recipes-ja	-	-
Japanese_nlp_scripts	-	-
DNorm-J	-	-
pyknp-eventgraph
ishi
python-npylm	-	-
python-npycrf	-	-
unsupervised-pos-tagging	-	-
negima
YouyakuMan	-	-
japanese-numbers-python
kantan	-	-
make-meidai-dialogue	-	-
japanese_summarizer	-	-
chirptext
yubin
jawiki-cleaner
japanese2phoneme
anlp_nlp2021_d3-1	-	-
aozora_classification	-	-
aozora-corpus-generator	-	-
JLM	-	-
NTM	-	-
EN-JP-ML-Lexicon	-	-
text-generation	-	-
chainer_nic	-	-
unihan-lm	-	-
mbart-finetuning	-	-
xvector_jtubespeech	-	-
TinySegmenterMaker	-	-
Grongish	-	-
WordCloud-Japanese	-	-
snark	-	-
toEmoji	-	-
kokkosho_data	-	-
JDT-with-KenLM-scoring	-	-
mixture-of-unigram-model	-	-
hidden-markov-model	-	-
Ngram-language-model	-	-
ASRDeepSpeech	-	-
neural_ime	-	-
neural_japanese_transliterator	-	-

Rust crate

Morphology analysis

lindera - A morphological analysis library.
vaporetto - Vaporetto: Very Accelerated POintwise pREdicTion based TOkenizer
goya - Japanese Morphological Analysis written in Rust
vibrato - vibrato: Viterbi-based accelerated tokenizer
yoin - A Japanese Morphological Analyzer written in pure Rust

Name	downloads/week	total downloads
lindera	-	-
vaporetto	-	-
goya	-	-
vibrato	-	-
yoin	-	-

Converter

wana_kana_rust - Utility library for checking and converting between Japanese characters - Hiragana, Katakana - and Romaji
unicode-jp-rs - A Rust library to convert Japanese Half-width-kana[半角ｶﾅ] and Wide-alphanumeric[全角英数] into normal ones
kana - [Mirror] CLI program for transliterating romaji text to either hiragana or katakana

Name	downloads/week	total downloads
wana_kana_rust	-	-
unicode-jp-rs	-	-
kana	-	-

Search engine library

lindera-tantivy - Lindera tokenizer for Tantivy.

Name	downloads/week	total downloads	stars
lindera-tantivy	-	-

Pretrained model

Word2Vec

japanese-words-to-vectors - Word2vec (word to vectors) approach for Japanese language using Gensim and Mecab.
chiVe - Japanese word embedding with Sudachi and NWJC
elmo-japanese - elmo-japanese
embedrank - Python Implementation of EmbedRank
japanese-words-to-vectors - Word2vec (word to vectors) approach for Japanese language using Gensim and Mecab.
aovec - Easy aozorabunko Word2Vec Builder - 青空文庫全書籍のWord2Vecビルダー+構築済みモデル
dependency-based-japanese-word-embeddings - This is a repository for the AI LAB article "係り受けに基づく日本語単語埋込 (Dependency-based Japanese Word Embeddings)" ( Article URL https://ai-lab.lapras.com/nlp/japanese-word-embedding/)
jawikivec - Yet Another Japanese-Wikipedia Entity Vectors
jawiki_word_vector_updater - 最新の日本語Wikipediaのダンプデータから，MeCabを用いてIPA辞書と最新のNeologd辞書の両方で形態素解析を実施し，その結果に基づいた word2vec，fastText，GloVeの単語分散表現を学習するためのスクリプト

Name	downloads/week	total downloads
japanese-words-to-vectors	-	-
chiVe	-	-
elmo-japanese	-	-
embedrank	-	-
japanese-words-to-vectors	-	-
aovec
dependency-based-japanese-word-embeddings	-	-
jawikivec	-	-
jawiki_word_vector_updater	-	-

Transformer based models

bert-japanese - BERT models for Japanese text.
japanese-pretrained-models - Code for producing Japanese pretrained models provided by rinna Co., Ltd.
bert-japanese - BERT with SentencePiece for Japanese text.
SudachiTra - Japanese tokenizer for Transformers
japanese-dialog-transformers - Code for evaluating Japanese pretrained models provided by NTT Ltd.
shiba - Pytorch implementation and pre-trained Japanese model for CANINE, the efficient character-level transformer.
Dialog - A PyTorch Implementation of japanese chatbot using BERT and Transformer's decoder
language-pretraining - BERT and ELECTRA models of PyTorch implementations for Japanese text.
medbertjp - Trials of pre-trained BERT models for the medical domain in Japanese.
ILYS-aoba-chatbot - ILYS-aoba-chatbot
t5-japanese - Codes to pre-train Japanese T5 models
pytorch_bert_japanese - PytorchでBERTの日本語学習済みモデルを利用する
Laboro-BERT-Japanese - Laboro BERT Japanese: Japanese BERT Pre-Trained With Web-Corpus
RoBERTa-japanese - Japanese BERT Pretrained Model
aMLP-japanese - aMLP Transformer Model for Japanese
bert-japanese-aozora - Japanese BERT trained on Aozora Bunko and Wikipedia, pre-tokenized by MeCab with UniDic & SudachiPy
sbert-ja - Code to train Sentence BERT Japanese model for Hugging Face Model Hub
BERT-Japan-vaccination - Official fine-tuning code for "Emotion Analysis of Japanese Tweets and Comparison to Vaccinations in Japan"
gpt2-japanese - Japanese GPT2 Generation Model
text2text-japanese - gpt-2 based text2text conversion model
gpt-ja - GPT-2 Japanese model for HuggingFace's transformers
friendly_JA-Model - MT model trained using the friendly_JA Corpus attempting to make Japanese easier/more accessible to occidental people by using the Latin/English derived katakana lexicon instead of the standard Sino-Japanese lexicon
albert-japanese - BERT with SentencePiece for Japanese text.
ja_text_bert - 日本語WikipediaコーパスでBERTのPre-Trainedモデルを生成するためのリポジトリ
bert-japanese-aozora - Japanese BERT trained on Aozora Bunko and Wikipedia, pre-tokenized by MeCab with UniDic & SudachiPy
DistilBERT-base-jp - A Japanese DistilBERT pretrained model, which was trained on Wikipedia.
bert - This repository provides snippets to use RoBERTa pre-trained on Japanese corpus. Our dataset consists of Japanese Wikipedia and web-scrolled articles, 25GB in total. The released model is built based on that from HuggingFace.
medbertjp - Trials of pre-trained BERT models for the medical domain in Japanese.
Laboro-DistilBERT-Japanese - Laboro DistilBERT Japanese

Name	downloads/week	total downloads
bert-japanese	-	-
japanese-pretrained-models	-	-
bert-japanese	-	-
SudachiTra
japanese-dialog-transformers	-	-
shiba
Dialog	-	-
language-pretraining	-	-
medbertjp	-	-
ILYS-aoba-chatbot	-	-
t5-japanese	-	-
pytorch_bert_japanese	-	-
Laboro-BERT-Japanese	-	-
RoBERTa-japanese	-	-
aMLP-japanese	-	-
bert-japanese-aozora	-	-
sbert-ja	-	-
BERT-Japan-vaccination	-	-
gpt2-japanese	-	-
text2text-japanese	-	-
gpt-ja	-	-
friendly_JA-Model	-	-
albert-japanese	-	-
ja_text_bert	-	-
bert-japanese-aozora	-	-
DistilBERT-base-jp	-	-
bert	-	-
medbertjp	-	-
Laboro-DistilBERT-Japanese	-	-

Dictionary

mecab-ipadic-neologd - Neologism dictionary based on the language resources on the Web for mecab-ipadic
tdmelodic - A Japanese accent dictionary generator
jamdict - Python 3 library for manipulating Jim Breen's JMdict, KanjiDic2, JMnedict and kanji-radical mappings
unidic-py - Unidic packaged for installation via pip.
Japanese-Company-Lexicon - Japanese Company Lexicon (JCLdic)
manbyo-sudachi - Sudachi向け万病辞書
jawiki-kana-kanji-dict - Generate SKK/MeCab dictionary from Wikipedia(Japanese edition)
JIWC-Dictionary - dictionary to find emotion related to text
JumanDIC - This repository contains source dictionary files to build dictionaries for JUMAN and Juman++.
ipadic-py - IPAdic packaged for easy use from Python.
unidic-lite - A small version of UniDic for easy pip installs.
emoji-ime-dictionary - 日本語で絵文字入力をするための IME 追加辞書 orange_book Google 日本語入力などで日本語から絵文字への変換を可能にする IME 拡張辞書
google-ime-dictionary - 日英変換・英語略語展開のための IME 追加辞書 orange_book 日本語から英語への和英変換や英語略語の展開を Google 日本語入力や ATOK などで可能にする IME 拡張辞書
dic-nico-intersection-pixiv - ニコニコ大百科とピクシブ百科事典の共通部分のIME辞書
google-ime-user-dictionary-ja-en - GoogleIME用カタカナ語辞書プロジェクトのアーカイブです。Project archive of Google IME user dictionary from Katakana word ( Japanese loanword ) to English.
emoticon - Google日本語入力の顔文字辞書∩(,,Ò‿Ó,,)∩
mecab-mozcdic - open source mozc dictionaryをMeCab辞書のフォーマットに変換したものです。

Name	downloads/week	total downloads
mecab-ipadic-neologd	-	-
tdmelodic	-	-
jamdict
unidic-py
Japanese-Company-Lexicon	-	-
manbyo-sudachi	-	-
jawiki-kana-kanji-dict	-	-
JIWC-Dictionary	-	-
JumanDIC	-	-
ipadic-py
unidic-lite
emoji-ime-dictionary	-	-
google-ime-dictionary	-	-
dic-nico-intersection-pixiv	-	-
google-ime-user-dictionary-ja-en	-	-
emoticon	-	-
mecab-mozcdic	-	-

Corpus

jrte-corpus - Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
open2ch-dialogue-corpus - おーぷん2ちゃんねるをクロールして作成した対話コーパス
kanji-data - A JSON kanji dataset with updated JLPT levels and WaniKani information
JapaneseWordSimilarityDataset - Japanese Word Similarity Dataset
simple-jppdb - A paraphrase database for Japanese text simplification
TwitterCorpus - 首都大日本語 Twitter コーパス
chABSA-dataset - chakki's Aspect-Based Sentiment Analysis dataset
ner-wikipedia-dataset - Wikipediaを用いた日本語の固有表現抽出データセット
JaQuAD - JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension (2022, Skelter Labs)
JaNLI - Japanese Adversarial Natural Language Inference Dataset
BSD - The Business Scene Dialogue corpus
dataset-list - lists of text corpus and more (mainly Japanese)
UD_Japanese-PUD - Parallel Universal Dependencies.
ebe-dataset - Evidence-based Explanation Dataset (AACL-IJCNLP 2020)
UD_Japanese-GSD - Japanese data from the Google UDT 2.0.
emoji-ja - UNICODE絵文字の日本語読み/キーワード/分類辞書
nayose-wikipedia-ja - Wikipediaから作成した日本語名寄せデータセット
IOB2Corpus - Japanese IOB2 tagged corpus for Named Entity Recognition.
ja.text8 - Japanese text8 corpus for word embedding.
ThreeLineSummaryDataset - 3行要約データセット
japanese - This repo contains a list of the 44,998 most common Japanese words in order of frequency, as determined by the University of Leeds Corpus.
kanji-frequency - Kanji usage frequency data collected from various sources
TEDxJP-10K - TEDxJP-10K ASR Evaluation Dataset
CoARiJ - Corpus of Annual Reports in Japan
small_parallel_enja - 50k English-Japanese Parallel Corpus for Machine Translation Benchmark.
KWDLC - Kyoto University Web Document Leads Corpus
AnnotatedFKCCorpus - Annotated Fuman Kaitori Center Corpus
technological-book-corpus-ja - 日本語で書かれた技術書を収集した生コーパス/ツール
ita-corpus-chuwa - Chunked word annotation for ITA corpus
asdc - Accommodation Search Dialog Corpus (宿泊施設探索対話コーパス)
wikipedia-utils - Utility scripts for preprocessing Wikipedia texts for NLP
Web-Crawled-Corpus-for-Japanese-Chinese-NMT - A Web Crawled Corpus for Japanese-Chinese NMT
inappropriate-words-ja - 日本語における不適切表現を収集します。自然言語処理の時のデータクリーニング用等に使えると思います。
house-of-councillors - 参議院の公式ウェブサイトから会派、議員、議案、質問主意書のデータを整理しました。
house-of-representatives - 国会議案データベース：衆議院
STAIR-captions - STAIR captions: large-scale Japanese image caption dataset
Winograd-Schema-Challenge-Ja - Japanese Translation of Winograd Schema Challenge
speechBSD - An extension of the BSD corpus with audio and speaker attribute information
ita-corpus - ITAコーパスの文章リスト
rohan4600 - モーラバランス型日本語コーパス
anlp-jp-history - 言語処理学会年次大会講演の全リスト・機械可読版など
JMRD - Japanese Movie Recommendation Dialogue dataset
keigo_transfer_task - 敬語変換タスクにおける評価用データセット
CourseraParallelCorpusMining - Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation
JESC - A large parallel corpus of English and Japanese
loanwords_gairaigo - English loanwords in Japanese
jawikicorpus - Japanese-Wikipedia Wikification Corpus
GeneralPolicySpeechOfPrimeMinisterOfJapan - This is the corpus of Japanese Text that general policy speech of prime minister of Japan
AMI-Meeting-Parallel-Corpus - AMI Meeting Parallel Corpus
giant_ja-en_parallel_corpus - This directory includes a giant Japanese-English subtitle corpus. The raw data comes from the Stanford’s JESC project.
japanese-corpus - 日本語の対話データ for seq2seq etc
jesc_small - Small Japanese-English Subtitle Corpus
wrime - WRIME: 主観と客観の感情分析データセット
jtubespeech - JTubeSpeech: Corpus of Japanese speech collected from YouTube
WikipediaWordFrequencyList - 日本語Wikipediaで使用される頻出単語のリスト
kokkosho_data - 車両不具合情報に関するデータセット
pdmocrdataset-part1 - デジタル化資料OCRテキスト化事業において作成されたOCR学習用データセット
huriganacorpus-ndlbib - 全国書誌データから作成した振り仮名のデータセット
jvs_hiho - JVS (Japanese versatile speech) コーパスの自作のラベル
graded-enja-corpus - 禁止用語や単語レベルを考慮した日英対訳コーパスです。
cjk-compsci-terms - CJK computer science terms comparison / 中日韓電腦科學術語對照 / 日中韓のコンピュータ科学の用語対照 / 한·중·일 전산학 용어 대조
Laboro-ParaCorpus - Scripts for creating a Japanese-English parallel corpus and training NMT models

Name	downloads/week	total downloads
jrte-corpus	-	-
open2ch-dialogue-corpus	-	-
kanji-data	-	-
JapaneseWordSimilarityDataset	-	-
simple-jppdb	-	-
TwitterCorpus	-	-
chABSA-dataset	-	-
ner-wikipedia-dataset	-	-
JaQuAD	-	-
JaNLI	-	-
BSD	-	-
dataset-list	-	-
UD_Japanese-PUD	-	-
ebe-dataset	-	-
UD_Japanese-GSD	-	-
emoji-ja	-	-
nayose-wikipedia-ja	-	-
IOB2Corpus	-	-
ja.text8	-	-
ThreeLineSummaryDataset	-	-
japanese	-	-
kanji-frequency	-	-
TEDxJP-10K	-	-
CoARiJ	-	-
small_parallel_enja	-	-
KWDLC	-	-
AnnotatedFKCCorpus	-	-
technological-book-corpus-ja	-	-
ita-corpus-chuwa	-	-
asdc	-	-
wikipedia-utils	-	-
Web-Crawled-Corpus-for-Japanese-Chinese-NMT	-	-
inappropriate-words-ja	-	-
house-of-councillors	-	-
house-of-representatives	-	-
STAIR-captions	-	-
Winograd-Schema-Challenge-Ja	-	-
speechBSD	-	-
ita-corpus	-	-
rohan4600	-	-
anlp-jp-history	-	-
JMRD	-	-
keigo_transfer_task	-	-
CourseraParallelCorpusMining	-	-
JESC	-	-
loanwords_gairaigo	-	-
jawikicorpus	-	-
GeneralPolicySpeechOfPrimeMinisterOfJapan	-	-
AMI-Meeting-Parallel-Corpus	-	-
giant_ja-en_parallel_corpus	-	-
japanese-corpus	-	-
jesc_small	-	-
wrime	-	-
jtubespeech	-	-
WikipediaWordFrequencyList	-	-
kokkosho_data	-	-
pdmocrdataset-part1	-	-
huriganacorpus-ndlbib	-	-
jvs_hiho	-	-
graded-enja-corpus	-	-
cjk-compsci-terms	-	-
Laboro-ParaCorpus	-	-

Tutorial

spacy_tutorial - spaCy tutorial in English and Japanese. spacy-transformers, BERT, GiNZA.
fastTextJapaneseTutorial - Tutorial to train fastText with Japanese corpus
allennlp-NER-ja - AllenNLP-NER-ja: AllenNLP による日本語を対象とした固有表現抽出
chariot-PyTorch-Japanese-text-classification - Experiment for Japanese Text classification using chariot and PyTorch
ginza-examples - 日本語NLPライブラリGiNZAのすゝめ
DocumentClassificationUsingBERT-Japanese - DocumentClassificationUsingBERT-Japanese
BERT_Japanese_Google_Colaboratory - Google Colaboratoryで日本語のBERTを動かす方法です。
bert-book - 「BERTによる自然言語処理入門: Transformersを使った実践プログラミング」サポートページ
janome-tutorial - Janome を使ったテキストマイニング入門チュートリアルです。
handson-language-models - 日本語の言語モデルのハンズオン資料です

Name	downloads/week	total downloads
spacy_tutorial	-	-
fastTextJapaneseTutorial	-	-
allennlp-NER-ja	-	-
chariot-PyTorch-Japanese-text-classification	-	-
ginza-examples	-	-
DocumentClassificationUsingBERT-Japanese	-	-
BERT_Japanese_Google_Colaboratory	-	-
bert-book	-	-
janome-tutorial	-	-
handson-language-models	-	-

Research summary

awesome-bert-japanese - A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information
GEC-Info-ja - 文法誤り訂正に関する日本語文献を収集・分類するためのリポジトリ

Name	downloads/week	total downloads	stars
awesome-bert-japanese	-	-
GEC-Info-ja	-	-

Reference

自然言語処理の餅屋
フリーで使える日本語の主な大規模言語モデルまとめ
yasuokaの日記：日本語係り受け解析器「2020年の総ざらえ」
yasuokaの日記：日本語係り受け解析器「2021年の総ざらえ」
https://github.com/topics/japanese?l=python
https://github.com/topics/japanese-language?l=python
https://github.com/search?o=desc&q=corpus+japanese&s=&type=Repositories
https://paperswithcode.com/datasets?lang=japanese
https://github.com/himkt/awesome-bert-japanese
Awesome-Rust-MachineLearning-日本語向けのrustクレートや記事等をまとめたもの

Contributors

kaisugi - website

awesome-japanese-nlp-resources
awesome-japanese-nlp-resources copied to clipboard

Metadata

awesome-japanese-nlp-resources

Contents

Python library

Morphology analysis

Parsing

Converter

Preprocessor

Sentence spliter

Sentiment analysis

Machine translation

Named entity recognition

OCR

Tool for pretrained models

Others

Rust crate

Morphology analysis

Converter

Search engine library

Pretrained model

Word2Vec

Transformer based models

Dictionary

Corpus

Tutorial

Research summary

Reference

Contributors

← Metadata

Owner

Metadata

awesome-japanese-nlp-resources awesome-japanese-nlp-resources copied to clipboard

Metadata

awesome-japanese-nlp-resources

Contents

Python library

Morphology analysis

Parsing

Converter

Preprocessor

Sentence spliter

Sentiment analysis

Machine translation

Named entity recognition

OCR

Tool for pretrained models

Others

Rust crate

Morphology analysis

Converter

Search engine library

Pretrained model

Word2Vec

Transformer based models

Dictionary

Corpus

Tutorial

Research summary

Reference

Contributors

← Metadata

Owner

Metadata

awesome-japanese-nlp-resources
awesome-japanese-nlp-resources copied to clipboard