corpus-linguistics topic
kanji-frequency
Kanji usage frequency data collected from various sources
corpus-db
A textual corpus database for the digital humanities.
goclassy
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
kontext
An advanced, extensible web front-end for the Manatee-open corpus search engine
sanskrit
Data for the quantitative study of (Vedic) Sanskrit
Natural-Language-Processing-with-Python-Analyzing-Text-with-the-Natural-Language-Toolkit
My solutions to selected exercises to "Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit" by Steven Bird, Ewan Klein, and Edward Loper.
PICCL
A set of workflows for corpus building through OCR, post-correction and normalisation
CogNet
CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates
nerus
Large silver standart Russian corpus with NER, morphology and syntax markup
biomedical_corpora
Table compiling the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). First version has was published as part of the paper...