Malayalam-Newspaper-Article-Dataset icon indicating copy to clipboard operation
Malayalam-Newspaper-Article-Dataset copied to clipboard

The project scraps articles from a malayalam newspaper website to create a corpus. A set of queries is created and corresponding ground truth answers is retrieved. This can be used as a dataset that c...

Malayalam-Newspaper-Article-Dataset

Project scraped articles from a malayalam newspaper(janmabhumi) website to create a corpus of news articles. Also a set of queries is created and corresponding ground truth answers is retrieved by a combination of bm25 method and tf-idf method. The dataset can be useful for creating tools like stemmer, stopwords removal, lemmatizers, etc...

Dataset includes news articles from the year 2014 to 2018

##Note

This repo is obsolete, and scrapping does not work on the mentioned site.

DATASET

Directly download the complete dataset from Dropbox

Email : [email protected]

Related Works

A similar repo with Telugu DataSet can be found here.