Anoop Kunchukuttan
Anoop Kunchukuttan
- ITRANS, IAST, WX, BrahmiNet and other romanization standards. - IPA-Indic scripts standard from IIT Madras - BIS Postag set
https://www.kaggle.com/disisbig/datasets
SUPARA0.8M: A BALANCED ENGLISH-BANGLA PARALLEL CORPUS https://ieee-dataport.org/documents/supara08m-balanced-english-bangla-parallel-corpus around 20k sentences
[Semantic Relatedness dataset](https://arxiv.org/abs/2402.08638) 4 Indic languages: hin, mar, pan, tel. 300-1000 sentences pairs in testset
Paper: https://arxiv.org/abs/2312.11361 Repo: https://github.com/project-miracl/nomiracl Languages: bn, hi, te * A testset to evaluate whether a paragraph contains an answer to a query * Use for evaluating hallucinations and error-rates in...
https://arxiv.org/abs/2305.08828 14 languages, for testing
L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages
Classification datasets for 10 Indian languages based on news articles. More challenging than existing sets. [Paper](https://arxiv.org/abs/2401.02254) [Repository](https://github.com/l3cube-pune/indic-nlp)
Paper: https://arxiv.org/abs/2401.00170
Chandamama Dataset and Small LLM for Telugu (upcoming) https://www.linkedin.com/feed/update/urn:li:activity:7147934433545773056/ Dataset: https://huggingface.co/datasets/swechatelangana/chandamama-kathalu