nlp-for-sanskrit icon indicating copy to clipboard operation
nlp-for-sanskrit copied to clipboard

State of the Art Language models and Classifier for Sanskrit language (ancient indian language)

NLP for Sanskrit

This repository contains State of the Art Language models and Classifier for Sanskrit, which is an ancient Indian language.

The models trained here have been used in Natural Language Toolkit for Indic Languages (iNLTK)

Dataset

Created as part of this project

  1. Sanskrit Wikipedia Articles

  2. Sanskrit Shlokas Dataset

Results

Language Model Perplexity

Architecture/Dataset Sanskrit Wikipedia Articles
ULMFiT ~6
TransformerXL ~3

Classification Metrics

ULMFiT
Dataset Accuracy Kappa Score
Sanskrit Shlokas Dataset 84.3 76.1

Visualizations

Embedding Space
Architecture Visualization
ULMFiT Embeddings projection
TransformerXL Embeddings projection

Pretrained Language Model

Download pretrained Language Model from here

Classifier

Download classifier from here

Tokenizer

Trained tokenizer using Google's sentencepiece

Download the trained model and vocabulary from here