awesome-japanese-nlp-resources icon indicating copy to clipboard operation
awesome-japanese-nlp-resources copied to clipboard

A curated list of resources dedicated to Python libraries, LLMs, dictionaries, and corpora of NLP for Japanese

awesome-japanese-nlp-resources

Awesome

A curated list of resources dedicated to Python libraries, pre-trained models, dictionaries, and corpora of NLP for Japanese

Your contributions are always welcome! Please read the Contribution guidelines before contributing.

Contents

  • Python library
    • Morphology analysis
    • Parsing
    • Converter
    • Preprocessor
    • Sentence spliter
    • Sentiment analysis
    • Machine translation
    • Named entity recognition
    • OCR
    • Tool for pretrained models
    • Others
  • Rust crate
    • Morphology analysis
    • Converter
    • Search engine library
  • Pretrained model
    • Word2Vec
    • Transformer based models
  • Dictionary
  • Corpus
  • Tutorial
  • Research summary
  • Reference
  • Contributors

Python library

Morphology analysis

  • sudachi.rs - SudachiPy 0.6* and above are developed as Sudachi.rs.
  • Janome - Japanese morphological analysis engine written in pure Python
  • mecab-python3 - mecab-python. mecab-python. you can find original version here:http://taku910.github.io/mecab/
  • mecab - This repository is for building Windows 64-bit MeCab binary and improving MeCab Python binding.
  • fugashi - A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
  • nagisa - A Japanese tokenizer based on recurrent neural networks
  • pyknp - A Python Module for JUMAN++/KNP
  • Mykytea-python - Python wrapper for KyTea
  • konoha - Konoha: Simple wrapper of Japanese Tokenizers
  • natto-py - natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.
  • rakutenma-python - Rakuten MA (Python version)
  • python-vaporetto - Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
  • dango - An easy to use tokenizer for Japanese text, aimed at language learners and non-linguists
  • rhoknp - Yet another Python binding for Juman++/KNP
Name downloads/week total downloads stars
SudachiPy Downloads Downloads GitHub Repo stars
Janome Downloads Downloads GitHub Repo stars
mecab-python3 Downloads Downloads GitHub Repo stars
mecab Downloads Downloads GitHub Repo stars
fugashi Downloads Downloads GitHub Repo stars
nagisa Downloads Downloads GitHub Repo stars
pyknp Downloads Downloads GitHub Repo stars
Mykytea-python Downloads Downloads GitHub Repo stars
konoha Downloads Downloads GitHub Repo stars
natto-py Downloads Downloads GitHub Repo stars
rakutenma-python Downloads Downloads GitHub Repo stars
python-vaporetto Downloads Downloads GitHub Repo stars
dango Downloads Downloads GitHub Repo stars
rhoknp Downloads Downloads GitHub Repo stars

Parsing

  • ginza - A Japanese NLP Library using spaCy as framework based on Universal Dependencies
  • cabocha - Yet Another Japanese Dependency Structure Analyzer
  • UniDic2UD - Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese
  • camphr - Camphr - NLP libary for creating pipeline components
  • SuPar-UniDic - Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese with BERT models
  • depccg - A* CCG Parser with a Supertag and Dependency Factored Model
  • bertknp - A Japanese dependency parser based on BERT
  • esupar - Tokenizer POS-Tagger and Dependency-parser with BERT/RoBERTa/DeBERTa models for Japanese and other languages
Name downloads/week total downloads stars
ginza Downloads Downloads GitHub Repo stars
cabocha Downloads Downloads GitHub Repo stars
UniDic2UD Downloads Downloads GitHub Repo stars
camphr Downloads Downloads GitHub Repo stars
SuPar-UniDic Downloads Downloads GitHub Repo stars
depccg Downloads Downloads GitHub Repo stars
bertknp - - GitHub Repo stars
esupar Downloads Downloads GitHub Repo stars

Converter

  • pykakasi - Lightweight converter from Japanese Kana-kanji sentences into Kana-Roman.
  • cutlet - Japanese to romaji converter in Python
  • alphabet2kana - Convert English alphabet to Katakana
  • Convert-Numbers-to-Japanese - Converts Arabic numerals, or 'western' style numbers, to a Japanese context.
  • mozcpy - Mozc for Python: Kana-Kanji converter
Name downloads/week total downloads stars
pykakasi Downloads Downloads GitHub Repo stars
cutlet Downloads Downloads GitHub Repo stars
alphabet2kana Downloads Downloads GitHub Repo stars
Convert-Numbers-to-Japanese - - GitHub Repo stars
mozcpy Downloads Downloads GitHub Repo stars

Preprocessor

  • neologdn - Japanese text normalizer for mecab-neologd
  • jaconv - Pure-Python Japanese character interconverter for Hiragana, Katakana, Hankaku, and Zenkaku
  • mojimoji - A fast converter between Japanese hankaku and zenkaku characters
  • text-cleaning - A powerful text cleaner for Japanese web texts
Name downloads/week total downloads stars
neologdn Downloads Downloads GitHub Repo stars
jaconv Downloads Downloads GitHub Repo stars
mojimoji Downloads Downloads GitHub Repo stars
text-cleaning - - GitHub Repo stars

Sentence spliter

  • Bunkai - Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)
  • japanese-sentence-breaker - Japanese Sentence Breaker
  • sengiri - Yet another sentence-level tokenizer for the Japanese text
  • budoux - Standalone. Small. Language-neutral. BudouX is the successor to Budou, the machine learning powered line break organizer tool.
  • ja_sentence_segmenter - japanese sentence segmentation library for python
  • hasami - A tool to perform sentence segmentation on Japanese text
  • kuzukiri - Japanese Text Segmenter for Python written in Rust
Name downloads/week total downloads stars
bunkai Downloads Downloads GitHub Repo stars
japanese-sentence-breaker Downloads Downloads GitHub Repo stars
sengiri Downloads Downloads GitHub Repo stars
budoux Downloads Downloads GitHub Repo stars
ja_sentence_segmenter Downloads Downloads GitHub Repo stars
hasami Downloads Downloads GitHub Repo stars
kuzukiri Downloads Downloads GitHub Repo stars

Sentiment analysis

  • oseti - Dictionary based Sentiment Analysis for Japanese
  • negapoji - Japanese negative positive classification.日本語文書のネガポジを判定。
  • pymlask - Emotion analyzer for Japanese text
  • asari - Japanese sentiment analyzer implemented in Python.
Name downloads/week total downloads stars
oseti Downloads Downloads GitHub Repo stars
negapoji - - GitHub Repo stars
pymlask Downloads Downloads GitHub Repo stars
asari Downloads Downloads GitHub Repo stars

Machine translation

  • jparacrawl-finetune - An example usage of JParaCrawl pre-trained Neural Machine Translation (NMT) models.
  • JASS - JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation (LREC2020) & Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation (ACM TALLIP)
  • PheMT - A phenomenon-wise evaluation dataset for Japanese-English machine translation robustness. The dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena; Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant. COLING 2020.
  • VISA - An ambiguous subtitles dataset for visual scene-aware machine translation
Name downloads/week total downloads stars
jparacrawl-finetune - - GitHub Repo stars
JASS - - GitHub Repo stars
PheMT - - GitHub Repo stars
VISA - - GitHub Repo stars

Named entity recognition

  • namaco - Character Based Named Entity Recognition.
  • entitypedia - Entitypedia is an Extended Named Entity Dictionary from Wikipedia.
  • noyaki - Converts character span label information to tokenized text-based label information.
  • bert-japanese-ner-finetuning - Code to perform finetuning of the BERT model. BERTモデルのファインチューニングで固有表現抽出用タスクのモデルを作成・使用するサンプルです
  • joint-information-extraction-hs - 詳細なアノテーション基準に基づく症例報告コーパスからの固有表現及び関係の抽出精度の推論を行うコード
Name downloads/week total downloads stars
namaco - - GitHub Repo stars
entitypedia - - GitHub Repo stars
noyaki Downloads Downloads GitHub Repo stars
bert-japanese-ner-finetuning - - GitHub Repo stars
joint-information-extraction-hs - - GitHub Repo stars

OCR

  • Manga OCR - About Optical character recognition for Japanese text, with the main focus being Japanese manga
  • mokuro - Read Japanese manga inside browser with selectable text.
  • handwritten-japanese-ocr - Handwritten Japanese OCR demo using touch panel to draw the input text using Intel OpenVINO toolkit
  • OCR_Japanease - 日本語OCR
  • ndlocr_cli - NDLOCRのアプリケーション
  • donut - Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
  • JMTrans - manga translator - get japanese manga from url to translate manga image
  • Kindai-OCR - OCR system for recognizing modern Japanese magazines
Name downloads/week total downloads stars
manga-ocr Downloads Downloads GitHub Repo stars
mokuro Downloads Downloads GitHub Repo stars
handwritten-japanese-ocr - - GitHub Repo stars
OCR_Japanease - - GitHub Repo stars
ndlocr_cli - - GitHub Repo stars
donut Downloads Downloads GitHub Repo stars
JMTrans - - GitHub Repo stars
Kindai-OCR - - GitHub Repo stars

Tool for pretrained models

Name downloads/week total downloads stars
JGLUE - - GitHub Repo stars
ginza-transformers Downloads Downloads GitHub Repo stars
t5_japanese_dialogue_generation - - GitHub Repo stars
japanese_text_classification - - GitHub Repo stars
Japanese-BERT-Sentiment-Analyzer - - GitHub Repo stars
jmlm_scoring - - GitHub Repo stars
allennlp-shiba-model Downloads Downloads GitHub Repo stars
evaluate_japanese_w2v - - GitHub Repo stars
gector-ja - - GitHub Repo stars
Japanese-BPEEncoder - - GitHub Repo stars
Japanese-BPEEncoder_V2 - - GitHub Repo stars
transformer-copy - - GitHub Repo stars

Others

  • namedivider-python - A tool for dividing the Japanese full name into a family name and a given name.
  • asa-python - A curated list of resources dedicated to Python libraries of NLP for Japanese
  • python_asa - python版日本語意味役割付与システム(ASA)
  • toiro - A comparison tool of Japanese tokenizers
  • ja-timex - 自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器
  • JapaneseTokenizers - A set of metrics for feature selection from text data
  • daaja - This repository has implementations of data augmentation for NLP for Japanese.
  • accel-brain-code - The purpose of this repository is to make prototypes as case study in the context of proof of concept(PoC) and research and development(R&D) that I have written in my website. The main research topics are Auto-Encoders in relation to the representation learning, the statistical machine learning for energy-based models, adversarial generation net…
  • kyoto-reader - A processor for KyotoCorpus, KWDLC, and AnnotatedFKCCorpus
  • nlplot - Visualization Module for Natural Language Processing
  • rake-ja - Rapid Automatic Keyword Extraction algorithm for Japanese
  • jel - Japanese Entity Linker.
  • MedNER-J - Latest version of MedEX/J (Japanese disease name extractor)
  • zunda-python - Zunda: Japanese Enhanced Modality Analyzer client for Python.
  • AIO2_DPR_baseline - https://www.nlp.ecei.tohoku.ac.jp/projects/aio/
  • showcase - A PyTorch implementation of the Japanese Predicate-Argument Structure (PAS) analyser presented in the paper of Matsubayashi & Inui (2018) with some improvements.
  • darts-clone-python - Darts-clone python binding
  • jrte-corpus_example - Example codes for Japanese Realistic Textual Entailment Corpus
  • desuwa - Feature annotator to morphemes and phrases based on KNP rule files (pure-Python)
  • HotPepperGourmetDialogue - Restaurant Search System through Dialogue in Japanese.
  • nlp-recipes-ja - Samples codes for natural language processing in Japanese
  • Japanese_nlp_scripts - Small example scripts for working with Japanese texts in Python
  • DNorm-J - Japanese version of DNorm
  • pyknp-eventgraph - EventGraph is a development platform for high-level NLP applications in Japanese.
  • ishi - Ishi: A volition classifier for Japanese
  • python-npylm - ベイズ階層言語モデルによる教師なし形態素解析
  • python-npycrf - 条件付確率場とベイズ階層言語モデルの統合による半教師あり形態素解析
  • unsupervised-pos-tagging - 教師なし品詞タグ推定
  • negima - Negima is a Python package to extract phrases in Japanese text by using the part-of-speeches based rules you defined.
  • YouyakuMan - Extractive summarizer using BertSum as summarization model
  • japanese-numbers-python - A parser for Japanese number (Kanji, arabic) in the natural language.
  • kantan - Lookup japanese words by radical patterns
  • make-meidai-dialogue - Get Japanese dialogue corpus
  • japanese_summarizer - A summarizer for Japanese articles.
  • chirptext - ChirpText is a collection of text processing tools for Python.
  • yubin - Japanese Address Munger
  • jawiki-cleaner - Japanese Wikipedia Cleaner
  • japanese2phoneme - A python library to convert Japanese to phoneme.
  • anlp_nlp2021_d3-1 - This repository contains codes related to the experiments in "An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification"
  • aozora_classification - About This project aims to classify Japanese sentence to how well similar to some Japanese classical writers, such as Soseki Natsume, Ogai Mori, Ryunosuke Akutagawa and so on.
  • aozora-corpus-generator - Generates plain or tokenized text files from the Aozora Bunko
  • JLM - A fast LSTM Language Model for large vocabulary language like Japanese and Chinese
  • NTM - Testing of Neural Topic Modeling for Japanese articles
  • EN-JP-ML-Lexicon - This is a English-Japanese lexicon for Machine Learning and Deep Learning terminology.
  • text-generation - Easy-to-use scripts to fine-tune GPT-2-JA with your own texts, to generate sentences, and to tweet them automatically.
  • chainer_nic - Neural Image Caption (NIC) on chainer, its pretrained models on English and Japanese image caption datasets.
  • unihan-lm - The official repository for "UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database", AACL-IJCNLP 2020
  • mbart-finetuning - Code to perform finetuning of the mBART model.
  • xvector_jtubespeech - xvector model on jtubespeech
  • TinySegmenterMaker - TinySegmenter用の学習モデルを自作するためのツール.
  • Grongish - 日本語とグロンギ語の相互変換スクリプト
  • WordCloud-Japanese - WordCloudでの日本語文章をMecab(形態素解析エンジン)を使用せずに形態素解析チックな表示を実現するスクリプト
  • snark - 日本語ワードネットを利用したDBアクセスライブラリ
  • toEmoji - 日本語文を絵文字だけの文に変換するなにか
  • kokkosho_data - 専門用語抽出アルゴリズムの実装の練習
  • JDT-with-KenLM-scoring - Japanese-Dialog-Transformerの応答候補に対して、KenLMによるN-gram言語モデルでスコアリングし、フィルタリング若しくはリランキングを行う。
  • mixture-of-unigram-model - Mixture of Unigram Model and Infinite Mixture of Unigram Model in Python. (混合ユニグラムモデルと無限混合ユニグラムモデル)
  • hidden-markov-model - Hidden Markov Model (HMM) and Infinite Hidden Markov Model (iHMM) in Python. (隠れマルコフモデルと無限隠れマルコフモデル)
  • Ngram-language-model - Ngram language model in Python. (Nグラム言語モデル)
  • ASRDeepSpeech - Automatic Speech Recognition with deepspeech2 model in pytorch with support from Zakuro AI.
  • neural_ime - Neural IME: Neural Input Method Engine
  • neural_japanese_transliterator - Can neural networks transliterate Romaji into Japanese correctly?
Name downloads/week total downloads stars
namedivider-python Downloads Downloads GitHub Repo stars
asa-python Downloads Downloads GitHub Repo stars
python_asa - - GitHub Repo stars
toiro Downloads Downloads GitHub Repo stars
ja-timex Downloads Downloads GitHub Repo stars
JapaneseTokenizers - - GitHub Repo stars
daaja Downloads Downloads GitHub Repo stars
accel-brain-code Downloads Downloads GitHub Repo stars
JGLUE - - GitHub Repo stars
kyoto-reader Downloads Downloads GitHub Repo stars
nlplot Downloads Downloads GitHub Repo stars
rake-ja - - GitHub Repo stars
jel Downloads Downloads GitHub Repo stars
MedNER-J - - GitHub Repo stars
zunda-python Downloads Downloads GitHub Repo stars
AIO2_DPR_baseline - - GitHub Repo stars
showcase Downloads Downloads GitHub Repo stars
darts-clone-python Downloads Downloads GitHub Repo stars
jrte-corpus_example - - GitHub Repo stars
desuwa Downloads Downloads GitHub Repo stars
HotPepperGourmetDialogue - - GitHub Repo stars
nlp-recipes-ja - - GitHub Repo stars
Japanese_nlp_scripts - - GitHub Repo stars
DNorm-J - - GitHub Repo stars
pyknp-eventgraph Downloads Downloads GitHub Repo stars
ishi Downloads Downloads GitHub Repo stars
python-npylm - - GitHub Repo stars
python-npycrf - - GitHub Repo stars
unsupervised-pos-tagging - - GitHub Repo stars
negima Downloads Downloads GitHub Repo stars
YouyakuMan - - GitHub Repo stars
japanese-numbers-python Downloads Downloads GitHub Repo stars
kantan - - GitHub Repo stars
make-meidai-dialogue - - GitHub Repo stars
japanese_summarizer - - GitHub Repo stars
chirptext Downloads Downloads GitHub Repo stars
yubin Downloads Downloads GitHub Repo stars
jawiki-cleaner Downloads Downloads GitHub Repo stars
japanese2phoneme Downloads Downloads GitHub Repo stars
anlp_nlp2021_d3-1 - - GitHub Repo stars
aozora_classification - - GitHub Repo stars
aozora-corpus-generator - - GitHub Repo stars
JLM - - GitHub Repo stars
NTM - - GitHub Repo stars
EN-JP-ML-Lexicon - - GitHub Repo stars
text-generation - - GitHub Repo stars
chainer_nic - - GitHub Repo stars
unihan-lm - - GitHub Repo stars
mbart-finetuning - - GitHub Repo stars
xvector_jtubespeech - - GitHub Repo stars
TinySegmenterMaker - - GitHub Repo stars
Grongish - - GitHub Repo stars
WordCloud-Japanese - - GitHub Repo stars
snark - - GitHub Repo stars
toEmoji - - GitHub Repo stars
kokkosho_data - - GitHub Repo stars
JDT-with-KenLM-scoring - - GitHub Repo stars
mixture-of-unigram-model - - GitHub Repo stars
hidden-markov-model - - GitHub Repo stars
Ngram-language-model - - GitHub Repo stars
ASRDeepSpeech - - GitHub Repo stars
neural_ime - - GitHub Repo stars
neural_japanese_transliterator - - GitHub Repo stars

Rust crate

Morphology analysis

  • lindera - A morphological analysis library.
  • vaporetto - Vaporetto: Very Accelerated POintwise pREdicTion based TOkenizer
  • goya - Japanese Morphological Analysis written in Rust
  • vibrato - vibrato: Viterbi-based accelerated tokenizer
  • yoin - A Japanese Morphological Analyzer written in pure Rust
Name downloads/week total downloads stars
lindera - - GitHub Repo stars
vaporetto - - GitHub Repo stars
goya - - GitHub Repo stars
vibrato - - GitHub Repo stars
yoin - - GitHub Repo stars

Converter

  • wana_kana_rust - Utility library for checking and converting between Japanese characters - Hiragana, Katakana - and Romaji
  • unicode-jp-rs - A Rust library to convert Japanese Half-width-kana[半角カナ] and Wide-alphanumeric[全角英数] into normal ones
  • kana - [Mirror] CLI program for transliterating romaji text to either hiragana or katakana
Name downloads/week total downloads stars
wana_kana_rust - - GitHub Repo stars
unicode-jp-rs - - GitHub Repo stars
kana - - GitHub Repo stars

Search engine library

Name downloads/week total downloads stars
lindera-tantivy - - GitHub Repo stars

Pretrained model

Word2Vec

  • japanese-words-to-vectors - Word2vec (word to vectors) approach for Japanese language using Gensim and Mecab.
  • chiVe - Japanese word embedding with Sudachi and NWJC
  • elmo-japanese - elmo-japanese
  • embedrank - Python Implementation of EmbedRank
  • japanese-words-to-vectors - Word2vec (word to vectors) approach for Japanese language using Gensim and Mecab.
  • aovec - Easy aozorabunko Word2Vec Builder - 青空文庫全書籍のWord2Vecビルダー+構築済みモデル
  • dependency-based-japanese-word-embeddings - This is a repository for the AI LAB article "係り受けに基づく日本語単語埋込 (Dependency-based Japanese Word Embeddings)" ( Article URL https://ai-lab.lapras.com/nlp/japanese-word-embedding/)
  • jawikivec - Yet Another Japanese-Wikipedia Entity Vectors
  • jawiki_word_vector_updater - 最新の日本語Wikipediaのダンプデータから,MeCabを用いてIPA辞書と最新のNeologd辞書の両方で形態素解析を実施し,その結果に基づいた word2vec,fastText,GloVeの単語分散表現を学習するためのスクリプト
Name downloads/week total downloads stars
japanese-words-to-vectors - - GitHub Repo stars
chiVe - - GitHub Repo stars
elmo-japanese - - GitHub Repo stars
embedrank - - GitHub Repo stars
japanese-words-to-vectors - - GitHub Repo stars
aovec Downloads Downloads GitHub Repo stars
dependency-based-japanese-word-embeddings - - GitHub Repo stars
jawikivec - - GitHub Repo stars
jawiki_word_vector_updater - - GitHub Repo stars

Transformer based models

  • bert-japanese - BERT models for Japanese text.
  • japanese-pretrained-models - Code for producing Japanese pretrained models provided by rinna Co., Ltd.
  • bert-japanese - BERT with SentencePiece for Japanese text.
  • SudachiTra - Japanese tokenizer for Transformers
  • japanese-dialog-transformers - Code for evaluating Japanese pretrained models provided by NTT Ltd.
  • shiba - Pytorch implementation and pre-trained Japanese model for CANINE, the efficient character-level transformer.
  • Dialog - A PyTorch Implementation of japanese chatbot using BERT and Transformer's decoder
  • language-pretraining - BERT and ELECTRA models of PyTorch implementations for Japanese text.
  • medbertjp - Trials of pre-trained BERT models for the medical domain in Japanese.
  • ILYS-aoba-chatbot - ILYS-aoba-chatbot
  • t5-japanese - Codes to pre-train Japanese T5 models
  • pytorch_bert_japanese - PytorchでBERTの日本語学習済みモデルを利用する
  • Laboro-BERT-Japanese - Laboro BERT Japanese: Japanese BERT Pre-Trained With Web-Corpus
  • RoBERTa-japanese - Japanese BERT Pretrained Model
  • aMLP-japanese - aMLP Transformer Model for Japanese
  • bert-japanese-aozora - Japanese BERT trained on Aozora Bunko and Wikipedia, pre-tokenized by MeCab with UniDic & SudachiPy
  • sbert-ja - Code to train Sentence BERT Japanese model for Hugging Face Model Hub
  • BERT-Japan-vaccination - Official fine-tuning code for "Emotion Analysis of Japanese Tweets and Comparison to Vaccinations in Japan"
  • gpt2-japanese - Japanese GPT2 Generation Model
  • text2text-japanese - gpt-2 based text2text conversion model
  • gpt-ja - GPT-2 Japanese model for HuggingFace's transformers
  • friendly_JA-Model - MT model trained using the friendly_JA Corpus attempting to make Japanese easier/more accessible to occidental people by using the Latin/English derived katakana lexicon instead of the standard Sino-Japanese lexicon
  • albert-japanese - BERT with SentencePiece for Japanese text.
  • ja_text_bert - 日本語WikipediaコーパスでBERTのPre-Trainedモデルを生成するためのリポジトリ
  • bert-japanese-aozora - Japanese BERT trained on Aozora Bunko and Wikipedia, pre-tokenized by MeCab with UniDic & SudachiPy
  • DistilBERT-base-jp - A Japanese DistilBERT pretrained model, which was trained on Wikipedia.
  • bert - This repository provides snippets to use RoBERTa pre-trained on Japanese corpus. Our dataset consists of Japanese Wikipedia and web-scrolled articles, 25GB in total. The released model is built based on that from HuggingFace.
  • medbertjp - Trials of pre-trained BERT models for the medical domain in Japanese.
  • Laboro-DistilBERT-Japanese - Laboro DistilBERT Japanese
Name downloads/week total downloads stars
bert-japanese - - GitHub Repo stars
japanese-pretrained-models - - GitHub Repo stars
bert-japanese - - GitHub Repo stars
SudachiTra Downloads Downloads GitHub Repo stars
japanese-dialog-transformers - - GitHub Repo stars
shiba Downloads Downloads GitHub Repo stars
Dialog - - GitHub Repo stars
language-pretraining - - GitHub Repo stars
medbertjp - - GitHub Repo stars
ILYS-aoba-chatbot - - GitHub Repo stars
t5-japanese - - GitHub Repo stars
pytorch_bert_japanese - - GitHub Repo stars
Laboro-BERT-Japanese - - GitHub Repo stars
RoBERTa-japanese - - GitHub Repo stars
aMLP-japanese - - GitHub Repo stars
bert-japanese-aozora - - GitHub Repo stars
sbert-ja - - GitHub Repo stars
BERT-Japan-vaccination - - GitHub Repo stars
gpt2-japanese - - GitHub Repo stars
text2text-japanese - - GitHub Repo stars
gpt-ja - - GitHub Repo stars
friendly_JA-Model - - GitHub Repo stars
albert-japanese - - GitHub Repo stars
ja_text_bert - - GitHub Repo stars
bert-japanese-aozora - - GitHub Repo stars
DistilBERT-base-jp - - GitHub Repo stars
bert - - GitHub Repo stars
medbertjp - - GitHub Repo stars
Laboro-DistilBERT-Japanese - - GitHub Repo stars

Dictionary

  • mecab-ipadic-neologd - Neologism dictionary based on the language resources on the Web for mecab-ipadic
  • tdmelodic - A Japanese accent dictionary generator
  • jamdict - Python 3 library for manipulating Jim Breen's JMdict, KanjiDic2, JMnedict and kanji-radical mappings
  • unidic-py - Unidic packaged for installation via pip.
  • Japanese-Company-Lexicon - Japanese Company Lexicon (JCLdic)
  • manbyo-sudachi - Sudachi向け万病辞書
  • jawiki-kana-kanji-dict - Generate SKK/MeCab dictionary from Wikipedia(Japanese edition)
  • JIWC-Dictionary - dictionary to find emotion related to text
  • JumanDIC - This repository contains source dictionary files to build dictionaries for JUMAN and Juman++.
  • ipadic-py - IPAdic packaged for easy use from Python.
  • unidic-lite - A small version of UniDic for easy pip installs.
  • emoji-ime-dictionary - 日本語で絵文字入力をするための IME 追加辞書 orange_book Google 日本語入力などで日本語から絵文字への変換を可能にする IME 拡張辞書
  • google-ime-dictionary - 日英変換・英語略語展開のための IME 追加辞書 orange_book 日本語から英語への和英変換や英語略語の展開を Google 日本語入力や ATOK などで可能にする IME 拡張辞書
  • dic-nico-intersection-pixiv - ニコニコ大百科とピクシブ百科事典の共通部分のIME辞書
  • google-ime-user-dictionary-ja-en - GoogleIME用カタカナ語辞書プロジェクトのアーカイブです。Project archive of Google IME user dictionary from Katakana word ( Japanese loanword ) to English.
  • emoticon - Google日本語入力の顔文字辞書∩(,,Ò‿Ó,,)∩
  • mecab-mozcdic - open source mozc dictionaryをMeCab辞書のフォーマットに変換したものです。
Name downloads/week total downloads stars
mecab-ipadic-neologd - - GitHub Repo stars
tdmelodic - - GitHub Repo stars
jamdict Downloads Downloads GitHub Repo stars
unidic-py Downloads Downloads GitHub Repo stars
Japanese-Company-Lexicon - - GitHub Repo stars
manbyo-sudachi - - GitHub Repo stars
jawiki-kana-kanji-dict - - GitHub Repo stars
JIWC-Dictionary - - GitHub Repo stars
JumanDIC - - GitHub Repo stars
ipadic-py Downloads Downloads GitHub Repo stars
unidic-lite Downloads Downloads GitHub Repo stars
emoji-ime-dictionary - - GitHub Repo stars
google-ime-dictionary - - GitHub Repo stars
dic-nico-intersection-pixiv - - GitHub Repo stars
google-ime-user-dictionary-ja-en - - GitHub Repo stars
emoticon - - GitHub Repo stars
mecab-mozcdic - - GitHub Repo stars

Corpus

  • jrte-corpus - Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
  • open2ch-dialogue-corpus - おーぷん2ちゃんねるをクロールして作成した対話コーパス
  • kanji-data - A JSON kanji dataset with updated JLPT levels and WaniKani information
  • JapaneseWordSimilarityDataset - Japanese Word Similarity Dataset
  • simple-jppdb - A paraphrase database for Japanese text simplification
  • TwitterCorpus - 首都大日本語 Twitter コーパス
  • chABSA-dataset - chakki's Aspect-Based Sentiment Analysis dataset
  • ner-wikipedia-dataset - Wikipediaを用いた日本語の固有表現抽出データセット
  • JaQuAD - JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension (2022, Skelter Labs)
  • JaNLI - Japanese Adversarial Natural Language Inference Dataset
  • BSD - The Business Scene Dialogue corpus
  • dataset-list - lists of text corpus and more (mainly Japanese)
  • UD_Japanese-PUD - Parallel Universal Dependencies.
  • ebe-dataset - Evidence-based Explanation Dataset (AACL-IJCNLP 2020)
  • UD_Japanese-GSD - Japanese data from the Google UDT 2.0.
  • emoji-ja - UNICODE絵文字の日本語読み/キーワード/分類辞書
  • nayose-wikipedia-ja - Wikipediaから作成した日本語名寄せデータセット
  • IOB2Corpus - Japanese IOB2 tagged corpus for Named Entity Recognition.
  • ja.text8 - Japanese text8 corpus for word embedding.
  • ThreeLineSummaryDataset - 3行要約データセット
  • japanese - This repo contains a list of the 44,998 most common Japanese words in order of frequency, as determined by the University of Leeds Corpus.
  • kanji-frequency - Kanji usage frequency data collected from various sources
  • TEDxJP-10K - TEDxJP-10K ASR Evaluation Dataset
  • CoARiJ - Corpus of Annual Reports in Japan
  • small_parallel_enja - 50k English-Japanese Parallel Corpus for Machine Translation Benchmark.
  • KWDLC - Kyoto University Web Document Leads Corpus
  • AnnotatedFKCCorpus - Annotated Fuman Kaitori Center Corpus
  • technological-book-corpus-ja - 日本語で書かれた技術書を収集した生コーパス/ツール
  • ita-corpus-chuwa - Chunked word annotation for ITA corpus
  • asdc - Accommodation Search Dialog Corpus (宿泊施設探索対話コーパス)
  • wikipedia-utils - Utility scripts for preprocessing Wikipedia texts for NLP
  • Web-Crawled-Corpus-for-Japanese-Chinese-NMT - A Web Crawled Corpus for Japanese-Chinese NMT
  • inappropriate-words-ja - 日本語における不適切表現を収集します。自然言語処理の時のデータクリーニング用等に使えると思います。
  • house-of-councillors - 参議院の公式ウェブサイトから会派、議員、議案、質問主意書のデータを整理しました。
  • house-of-representatives - 国会議案データベース:衆議院
  • STAIR-captions - STAIR captions: large-scale Japanese image caption dataset
  • Winograd-Schema-Challenge-Ja - Japanese Translation of Winograd Schema Challenge
  • speechBSD - An extension of the BSD corpus with audio and speaker attribute information
  • ita-corpus - ITAコーパスの文章リスト
  • rohan4600 - モーラバランス型日本語コーパス
  • anlp-jp-history - 言語処理学会年次大会講演の全リスト・機械可読版など
  • JMRD - Japanese Movie Recommendation Dialogue dataset
  • keigo_transfer_task - 敬語変換タスクにおける評価用データセット
  • CourseraParallelCorpusMining - Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation
  • JESC - A large parallel corpus of English and Japanese
  • loanwords_gairaigo - English loanwords in Japanese
  • jawikicorpus - Japanese-Wikipedia Wikification Corpus
  • GeneralPolicySpeechOfPrimeMinisterOfJapan - This is the corpus of Japanese Text that general policy speech of prime minister of Japan
  • AMI-Meeting-Parallel-Corpus - AMI Meeting Parallel Corpus
  • giant_ja-en_parallel_corpus - This directory includes a giant Japanese-English subtitle corpus. The raw data comes from the Stanford’s JESC project.
  • japanese-corpus - 日本語の対話データ for seq2seq etc
  • jesc_small - Small Japanese-English Subtitle Corpus
  • wrime - WRIME: 主観と客観の感情分析データセット
  • jtubespeech - JTubeSpeech: Corpus of Japanese speech collected from YouTube
  • WikipediaWordFrequencyList - 日本語Wikipediaで使用される頻出単語のリスト
  • kokkosho_data - 車両不具合情報に関するデータセット
  • pdmocrdataset-part1 - デジタル化資料OCRテキスト化事業において作成されたOCR学習用データセット
  • huriganacorpus-ndlbib - 全国書誌データから作成した振り仮名のデータセット
  • jvs_hiho - JVS (Japanese versatile speech) コーパスの自作のラベル
  • graded-enja-corpus - 禁止用語や単語レベルを考慮した日英対訳コーパスです。
  • cjk-compsci-terms - CJK computer science terms comparison / 中日韓電腦科學術語對照 / 日中韓のコンピュータ科学の用語対照 / 한·중·일 전산학 용어 대조
  • Laboro-ParaCorpus - Scripts for creating a Japanese-English parallel corpus and training NMT models
Name downloads/week total downloads stars
jrte-corpus - - GitHub Repo stars
open2ch-dialogue-corpus - - GitHub Repo stars
kanji-data - - GitHub Repo stars
JapaneseWordSimilarityDataset - - GitHub Repo stars
simple-jppdb - - GitHub Repo stars
TwitterCorpus - - GitHub Repo stars
chABSA-dataset - - GitHub Repo stars
ner-wikipedia-dataset - - GitHub Repo stars
JaQuAD - - GitHub Repo stars
JaNLI - - GitHub Repo stars
BSD - - GitHub Repo stars
dataset-list - - GitHub Repo stars
UD_Japanese-PUD - - GitHub Repo stars
ebe-dataset - - GitHub Repo stars
UD_Japanese-GSD - - GitHub Repo stars
emoji-ja - - GitHub Repo stars
nayose-wikipedia-ja - - GitHub Repo stars
IOB2Corpus - - GitHub Repo stars
ja.text8 - - GitHub Repo stars
ThreeLineSummaryDataset - - GitHub Repo stars
japanese - - GitHub Repo stars
kanji-frequency - - GitHub Repo stars
TEDxJP-10K - - GitHub Repo stars
CoARiJ - - GitHub Repo stars
small_parallel_enja - - GitHub Repo stars
KWDLC - - GitHub Repo stars
AnnotatedFKCCorpus - - GitHub Repo stars
technological-book-corpus-ja - - GitHub Repo stars
ita-corpus-chuwa - - GitHub Repo stars
asdc - - GitHub Repo stars
wikipedia-utils - - GitHub Repo stars
Web-Crawled-Corpus-for-Japanese-Chinese-NMT - - GitHub Repo stars
inappropriate-words-ja - - GitHub Repo stars
house-of-councillors - - GitHub Repo stars
house-of-representatives - - GitHub Repo stars
STAIR-captions - - GitHub Repo stars
Winograd-Schema-Challenge-Ja - - GitHub Repo stars
speechBSD - - GitHub Repo stars
ita-corpus - - GitHub Repo stars
rohan4600 - - GitHub Repo stars
anlp-jp-history - - GitHub Repo stars
JMRD - - GitHub Repo stars
keigo_transfer_task - - GitHub Repo stars
CourseraParallelCorpusMining - - GitHub Repo stars
JESC - - GitHub Repo stars
loanwords_gairaigo - - GitHub Repo stars
jawikicorpus - - GitHub Repo stars
GeneralPolicySpeechOfPrimeMinisterOfJapan - - GitHub Repo stars
AMI-Meeting-Parallel-Corpus - - GitHub Repo stars
giant_ja-en_parallel_corpus - - GitHub Repo stars
japanese-corpus - - GitHub Repo stars
jesc_small - - GitHub Repo stars
wrime - - GitHub Repo stars
jtubespeech - - GitHub Repo stars
WikipediaWordFrequencyList - - GitHub Repo stars
kokkosho_data - - GitHub Repo stars
pdmocrdataset-part1 - - GitHub Repo stars
huriganacorpus-ndlbib - - GitHub Repo stars
jvs_hiho - - GitHub Repo stars
graded-enja-corpus - - GitHub Repo stars
cjk-compsci-terms - - GitHub Repo stars
Laboro-ParaCorpus - - GitHub Repo stars

Tutorial

Name downloads/week total downloads stars
spacy_tutorial - - GitHub Repo stars
fastTextJapaneseTutorial - - GitHub Repo stars
allennlp-NER-ja - - GitHub Repo stars
chariot-PyTorch-Japanese-text-classification - - GitHub Repo stars
ginza-examples - - GitHub Repo stars
DocumentClassificationUsingBERT-Japanese - - GitHub Repo stars
BERT_Japanese_Google_Colaboratory - - GitHub Repo stars
bert-book - - GitHub Repo stars
janome-tutorial - - GitHub Repo stars
handson-language-models - - GitHub Repo stars

Research summary

  • awesome-bert-japanese - A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information
  • GEC-Info-ja - 文法誤り訂正に関する日本語文献を収集・分類するためのリポジトリ
Name downloads/week total downloads stars
awesome-bert-japanese - - GitHub Repo stars
GEC-Info-ja - - GitHub Repo stars

Reference

Contributors