corpus-linguistics topic

List corpus-linguistics repositories

kanji-frequency

121
Stars
19
Forks
Watchers

Kanji usage frequency data collected from various sources

corpus-db

57
Stars
8
Forks
Watchers

A textual corpus database for the digital humanities.

goclassy

85
Stars
6
Forks
Watchers

An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.

kontext

59
Stars
22
Forks
Watchers

An advanced, extensible web front-end for the Manatee-open corpus search engine

sanskrit

104
Stars
41
Forks
Watchers

Data for the quantitative study of (Vedic) Sanskrit

My solutions to selected exercises to "Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit" by Steven Bird, Ewan Klein, and Edward Loper.

PICCL

46
Stars
6
Forks
Watchers

A set of workflows for corpus building through OCR, post-correction and normalisation

CogNet

42
Stars
9
Forks
Watchers

CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates

nerus

58
Stars
10
Forks
Watchers

Large silver standart Russian corpus with NER, morphology and syntax markup

biomedical_corpora

18
Stars
4
Forks
Watchers

Table compiling the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). First version has was published as part of the paper...