awesome-japanese-nlp-resources
![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)
A curated list of resources dedicated to Python libraries, pre-trained models, dictionaries, and corpora of NLP for Japanese
Your contributions are always welcome!
Please read the Contribution guidelines before contributing.
Contents
-
Python library
-
Morphology analysis
-
Parsing
-
Converter
-
Preprocessor
-
Sentence spliter
-
Sentiment analysis
-
Machine translation
-
Named entity recognition
-
OCR
-
Tool for pretrained models
-
Others
-
Rust crate
-
Morphology analysis
-
Converter
-
Search engine library
-
Pretrained model
-
Word2Vec
-
Transformer based models
-
Dictionary
-
Corpus
-
Tutorial
-
Research summary
-
Reference
-
Contributors
Python library
Morphology analysis
-
sudachi.rs - SudachiPy 0.6* and above are developed as Sudachi.rs.
-
Janome - Japanese morphological analysis engine written in pure Python
-
mecab-python3 - mecab-python. mecab-python. you can find original version here:http://taku910.github.io/mecab/
-
mecab - This repository is for building Windows 64-bit MeCab binary and improving MeCab Python binding.
-
fugashi - A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
-
nagisa - A Japanese tokenizer based on recurrent neural networks
-
pyknp - A Python Module for JUMAN++/KNP
-
Mykytea-python - Python wrapper for KyTea
-
konoha - Konoha: Simple wrapper of Japanese Tokenizers
-
natto-py - natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.
-
rakutenma-python - Rakuten MA (Python version)
-
python-vaporetto - Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
-
dango - An easy to use tokenizer for Japanese text, aimed at language learners and non-linguists
-
rhoknp - Yet another Python binding for Juman++/KNP
Parsing
-
ginza - A Japanese NLP Library using spaCy as framework based on Universal Dependencies
-
cabocha - Yet Another Japanese Dependency Structure Analyzer
-
UniDic2UD - Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese
-
camphr - Camphr - NLP libary for creating pipeline components
-
SuPar-UniDic - Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese with BERT models
-
depccg - A* CCG Parser with a Supertag and Dependency Factored Model
-
bertknp - A Japanese dependency parser based on BERT
-
esupar - Tokenizer POS-Tagger and Dependency-parser with BERT/RoBERTa/DeBERTa models for Japanese and other languages
Converter
-
pykakasi - Lightweight converter from Japanese Kana-kanji sentences into Kana-Roman.
-
cutlet - Japanese to romaji converter in Python
-
alphabet2kana - Convert English alphabet to Katakana
-
Convert-Numbers-to-Japanese - Converts Arabic numerals, or 'western' style numbers, to a Japanese context.
-
mozcpy - Mozc for Python: Kana-Kanji converter
Preprocessor
-
neologdn - Japanese text normalizer for mecab-neologd
-
jaconv - Pure-Python Japanese character interconverter for Hiragana, Katakana, Hankaku, and Zenkaku
-
mojimoji - A fast converter between Japanese hankaku and zenkaku characters
-
text-cleaning - A powerful text cleaner for Japanese web texts
Sentence spliter
-
Bunkai - Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)
-
japanese-sentence-breaker - Japanese Sentence Breaker
-
sengiri - Yet another sentence-level tokenizer for the Japanese text
-
budoux - Standalone. Small. Language-neutral. BudouX is the successor to Budou, the machine learning powered line break organizer tool.
-
ja_sentence_segmenter - japanese sentence segmentation library for python
-
hasami - A tool to perform sentence segmentation on Japanese text
-
kuzukiri - Japanese Text Segmenter for Python written in Rust
Sentiment analysis
-
oseti - Dictionary based Sentiment Analysis for Japanese
-
negapoji - Japanese negative positive classification.日本語文書のネガポジを判定。
-
pymlask - Emotion analyzer for Japanese text
-
asari - Japanese sentiment analyzer implemented in Python.
Machine translation
-
jparacrawl-finetune - An example usage of JParaCrawl pre-trained Neural Machine Translation (NMT) models.
-
JASS - JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation (LREC2020) & Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation (ACM TALLIP)
-
PheMT - A phenomenon-wise evaluation dataset for Japanese-English machine translation robustness. The dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena; Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant. COLING 2020.
-
VISA - An ambiguous subtitles dataset for visual scene-aware machine translation
Named entity recognition
-
namaco - Character Based Named Entity Recognition.
-
entitypedia - Entitypedia is an Extended Named Entity Dictionary from Wikipedia.
-
noyaki - Converts character span label information to tokenized text-based label information.
-
bert-japanese-ner-finetuning - Code to perform finetuning of the BERT model. BERTモデルのファインチューニングで固有表現抽出用タスクのモデルを作成・使用するサンプルです
-
joint-information-extraction-hs - 詳細なアノテーション基準に基づく症例報告コーパスからの固有表現及び関係の抽出精度の推論を行うコード
OCR
-
Manga OCR - About Optical character recognition for Japanese text, with the main focus being Japanese manga
-
mokuro - Read Japanese manga inside browser with selectable text.
-
handwritten-japanese-ocr - Handwritten Japanese OCR demo using touch panel to draw the input text using Intel OpenVINO toolkit
-
OCR_Japanease - 日本語OCR
-
ndlocr_cli - NDLOCRのアプリケーション
-
donut - Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
-
JMTrans - manga translator - get japanese manga from url to translate manga image
-
Kindai-OCR - OCR system for recognizing modern Japanese magazines
Tool for pretrained models
Others
-
namedivider-python - A tool for dividing the Japanese full name into a family name and a given name.
-
asa-python - A curated list of resources dedicated to Python libraries of NLP for Japanese
-
python_asa - python版日本語意味役割付与システム(ASA)
-
toiro - A comparison tool of Japanese tokenizers
-
ja-timex - 自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器
-
JapaneseTokenizers - A set of metrics for feature selection from text data
-
daaja - This repository has implementations of data augmentation for NLP for Japanese.
-
accel-brain-code - The purpose of this repository is to make prototypes as case study in the context of proof of concept(PoC) and research and development(R&D) that I have written in my website. The main research topics are Auto-Encoders in relation to the representation learning, the statistical machine learning for energy-based models, adversarial generation net…
-
kyoto-reader - A processor for KyotoCorpus, KWDLC, and AnnotatedFKCCorpus
-
nlplot - Visualization Module for Natural Language Processing
-
rake-ja - Rapid Automatic Keyword Extraction algorithm for Japanese
-
jel - Japanese Entity Linker.
-
MedNER-J - Latest version of MedEX/J (Japanese disease name extractor)
-
zunda-python - Zunda: Japanese Enhanced Modality Analyzer client for Python.
-
AIO2_DPR_baseline - https://www.nlp.ecei.tohoku.ac.jp/projects/aio/
-
showcase - A PyTorch implementation of the Japanese Predicate-Argument Structure (PAS) analyser presented in the paper of Matsubayashi & Inui (2018) with some improvements.
-
darts-clone-python - Darts-clone python binding
-
jrte-corpus_example - Example codes for Japanese Realistic Textual Entailment Corpus
-
desuwa - Feature annotator to morphemes and phrases based on KNP rule files (pure-Python)
-
HotPepperGourmetDialogue - Restaurant Search System through Dialogue in Japanese.
-
nlp-recipes-ja - Samples codes for natural language processing in Japanese
-
Japanese_nlp_scripts - Small example scripts for working with Japanese texts in Python
-
DNorm-J - Japanese version of DNorm
-
pyknp-eventgraph - EventGraph is a development platform for high-level NLP applications in Japanese.
-
ishi - Ishi: A volition classifier for Japanese
-
python-npylm - ベイズ階層言語モデルによる教師なし形態素解析
-
python-npycrf - 条件付確率場とベイズ階層言語モデルの統合による半教師あり形態素解析
-
unsupervised-pos-tagging - 教師なし品詞タグ推定
-
negima - Negima is a Python package to extract phrases in Japanese text by using the part-of-speeches based rules you defined.
-
YouyakuMan - Extractive summarizer using BertSum as summarization model
-
japanese-numbers-python - A parser for Japanese number (Kanji, arabic) in the natural language.
-
kantan - Lookup japanese words by radical patterns
-
make-meidai-dialogue - Get Japanese dialogue corpus
-
japanese_summarizer - A summarizer for Japanese articles.
-
chirptext - ChirpText is a collection of text processing tools for Python.
-
yubin - Japanese Address Munger
-
jawiki-cleaner - Japanese Wikipedia Cleaner
-
japanese2phoneme - A python library to convert Japanese to phoneme.
-
anlp_nlp2021_d3-1 - This repository contains codes related to the experiments in "An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification"
-
aozora_classification - About
This project aims to classify Japanese sentence to how well similar to some Japanese classical writers, such as Soseki Natsume, Ogai Mori, Ryunosuke Akutagawa and so on.
-
aozora-corpus-generator - Generates plain or tokenized text files from the Aozora Bunko
-
JLM - A fast LSTM Language Model for large vocabulary language like Japanese and Chinese
-
NTM - Testing of Neural Topic Modeling for Japanese articles
-
EN-JP-ML-Lexicon - This is a English-Japanese lexicon for Machine Learning and Deep Learning terminology.
-
text-generation - Easy-to-use scripts to fine-tune GPT-2-JA with your own texts, to generate sentences, and to tweet them automatically.
-
chainer_nic - Neural Image Caption (NIC) on chainer, its pretrained models on English and Japanese image caption datasets.
-
unihan-lm - The official repository for "UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database", AACL-IJCNLP 2020
-
mbart-finetuning - Code to perform finetuning of the mBART model.
-
xvector_jtubespeech - xvector model on jtubespeech
-
TinySegmenterMaker - TinySegmenter用の学習モデルを自作するためのツール.
-
Grongish - 日本語とグロンギ語の相互変換スクリプト
-
WordCloud-Japanese - WordCloudでの日本語文章をMecab(形態素解析エンジン)を使用せずに形態素解析チックな表示を実現するスクリプト
-
snark - 日本語ワードネットを利用したDBアクセスライブラリ
-
toEmoji - 日本語文を絵文字だけの文に変換するなにか
-
kokkosho_data - 専門用語抽出アルゴリズムの実装の練習
-
JDT-with-KenLM-scoring - Japanese-Dialog-Transformerの応答候補に対して、KenLMによるN-gram言語モデルでスコアリングし、フィルタリング若しくはリランキングを行う。
-
mixture-of-unigram-model - Mixture of Unigram Model and Infinite Mixture of Unigram Model in Python. (混合ユニグラムモデルと無限混合ユニグラムモデル)
-
hidden-markov-model - Hidden Markov Model (HMM) and Infinite Hidden Markov Model (iHMM) in Python. (隠れマルコフモデルと無限隠れマルコフモデル)
-
Ngram-language-model - Ngram language model in Python. (Nグラム言語モデル)
-
ASRDeepSpeech - Automatic Speech Recognition with deepspeech2 model in pytorch with support from Zakuro AI.
-
neural_ime - Neural IME: Neural Input Method Engine
-
neural_japanese_transliterator - Can neural networks transliterate Romaji into Japanese correctly?
Rust crate
Morphology analysis
-
lindera - A morphological analysis library.
-
vaporetto - Vaporetto: Very Accelerated POintwise pREdicTion based TOkenizer
-
goya - Japanese Morphological Analysis written in Rust
-
vibrato - vibrato: Viterbi-based accelerated tokenizer
-
yoin - A Japanese Morphological Analyzer written in pure Rust
Converter
-
wana_kana_rust - Utility library for checking and converting between Japanese characters - Hiragana, Katakana - and Romaji
-
unicode-jp-rs - A Rust library to convert Japanese Half-width-kana[半角カナ] and Wide-alphanumeric[全角英数] into normal ones
-
kana - [Mirror] CLI program for transliterating romaji text to either hiragana or katakana
Search engine library
Pretrained model
Word2Vec
-
japanese-words-to-vectors - Word2vec (word to vectors) approach for Japanese language using Gensim and Mecab.
-
chiVe - Japanese word embedding with Sudachi and NWJC
-
elmo-japanese - elmo-japanese
-
embedrank - Python Implementation of EmbedRank
-
japanese-words-to-vectors - Word2vec (word to vectors) approach for Japanese language using Gensim and Mecab.
-
aovec - Easy aozorabunko Word2Vec Builder - 青空文庫全書籍のWord2Vecビルダー+構築済みモデル
-
dependency-based-japanese-word-embeddings - This is a repository for the AI LAB article "係り受けに基づく日本語単語埋込 (Dependency-based Japanese Word Embeddings)" ( Article URL https://ai-lab.lapras.com/nlp/japanese-word-embedding/)
-
jawikivec - Yet Another Japanese-Wikipedia Entity Vectors
-
jawiki_word_vector_updater - 最新の日本語Wikipediaのダンプデータから,MeCabを用いてIPA辞書と最新のNeologd辞書の両方で形態素解析を実施し,その結果に基づいた word2vec,fastText,GloVeの単語分散表現を学習するためのスクリプト
Transformer based models
-
bert-japanese - BERT models for Japanese text.
-
japanese-pretrained-models - Code for producing Japanese pretrained models provided by rinna Co., Ltd.
-
bert-japanese - BERT with SentencePiece for Japanese text.
-
SudachiTra - Japanese tokenizer for Transformers
-
japanese-dialog-transformers - Code for evaluating Japanese pretrained models provided by NTT Ltd.
-
shiba - Pytorch implementation and pre-trained Japanese model for CANINE, the efficient character-level transformer.
-
Dialog - A PyTorch Implementation of japanese chatbot using BERT and Transformer's decoder
-
language-pretraining - BERT and ELECTRA models of PyTorch implementations for Japanese text.
-
medbertjp - Trials of pre-trained BERT models for the medical domain in Japanese.
-
ILYS-aoba-chatbot - ILYS-aoba-chatbot
-
t5-japanese - Codes to pre-train Japanese T5 models
-
pytorch_bert_japanese - PytorchでBERTの日本語学習済みモデルを利用する
-
Laboro-BERT-Japanese - Laboro BERT Japanese: Japanese BERT Pre-Trained With Web-Corpus
-
RoBERTa-japanese - Japanese BERT Pretrained Model
-
aMLP-japanese - aMLP Transformer Model for Japanese
-
bert-japanese-aozora - Japanese BERT trained on Aozora Bunko and Wikipedia, pre-tokenized by MeCab with UniDic & SudachiPy
-
sbert-ja - Code to train Sentence BERT Japanese model for Hugging Face Model Hub
-
BERT-Japan-vaccination - Official fine-tuning code for "Emotion Analysis of Japanese Tweets and Comparison to Vaccinations in Japan"
-
gpt2-japanese - Japanese GPT2 Generation Model
-
text2text-japanese - gpt-2 based text2text conversion model
-
gpt-ja - GPT-2 Japanese model for HuggingFace's transformers
-
friendly_JA-Model - MT model trained using the friendly_JA Corpus attempting to make Japanese easier/more accessible to occidental people by using the Latin/English derived katakana lexicon instead of the standard Sino-Japanese lexicon
-
albert-japanese - BERT with SentencePiece for Japanese text.
-
ja_text_bert - 日本語WikipediaコーパスでBERTのPre-Trainedモデルを生成するためのリポジトリ
-
bert-japanese-aozora - Japanese BERT trained on Aozora Bunko and Wikipedia, pre-tokenized by MeCab with UniDic & SudachiPy
-
DistilBERT-base-jp - A Japanese DistilBERT pretrained model, which was trained on Wikipedia.
-
bert - This repository provides snippets to use RoBERTa pre-trained on Japanese corpus. Our dataset consists of Japanese Wikipedia and web-scrolled articles, 25GB in total. The released model is built based on that from HuggingFace.
-
medbertjp - Trials of pre-trained BERT models for the medical domain in Japanese.
-
Laboro-DistilBERT-Japanese - Laboro DistilBERT Japanese
Dictionary
Corpus
Tutorial
Research summary
-
awesome-bert-japanese - A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information
-
GEC-Info-ja - 文法誤り訂正に関する日本語文献を収集・分類するためのリポジトリ
Reference
Contributors