Anoop Kunchukuttan issues

Results 93 issues of


                                            Anoop Kunchukuttan

COVID19 translations

https://github.com/neulab/covid19-datashare

wontfix

Add different Standards

- ITRANS, IAST, WX, BrahmiNet and other romanization standards. - IPA-Indic scripts standard from IIT Madras - BIS Postag set

Add iNLTK datasets

https://www.kaggle.com/disisbig/datasets

Supara English-Bengali parallel corpus

SUPARA0.8M: A BALANCED ENGLISH-BANGLA PARALLEL CORPUS https://ieee-dataport.org/documents/supara08m-balanced-english-bangla-parallel-corpus around 20k sentences

wontfix

SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 14 Languages

[Semantic Relatedness dataset](https://arxiv.org/abs/2402.08638) 4 Indic languages: hin, mar, pan, tel. 300-1000 sentences pairs in testset

NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation

Paper: https://arxiv.org/abs/2312.11361 Repo: https://github.com/project-miracl/nomiracl Languages: bn, hi, te * A testset to evaluate whether a paragraph contains an answer to a query * Use for evaluating hallucinations and error-rates in...

PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India

https://arxiv.org/abs/2305.08828 14 languages, for testing

L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages

Classification datasets for 10 Indian languages based on news articles. More challenging than existing sets. [Paper](https://arxiv.org/abs/2401.02254) [Repository](https://github.com/l3cube-pune/indic-nlp)

L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT models

Paper: https://arxiv.org/abs/2401.00170

Chandamama Kathalu

Chandamama Dataset and Small LLM for Telugu (upcoming) https://www.linkedin.com/feed/update/urn:li:activity:7147934433545773056/ Dataset: https://huggingface.co/datasets/swechatelangana/chandamama-kathalu