text-cleaning topic

List text-cleaning repositories

trafilatura

3.0k
Stars
228
Forks
Watchers

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

HarvestText

2.3k
Stars
330
Forks
Watchers

文本挖掘和预处理工具(文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等),无监督或弱监督方法

clean-text

929
Stars
77
Forks
Watchers

🧹 Python package for text cleaning

textclean

239
Stars
26
Forks
Watchers

Tools for cleaning and normalizing text data

extractnet

182
Stars
20
Forks
Watchers

A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

Takin

26
Stars
6
Forks
Watchers

A Python toolkit for file processing, text cleaning and data splitting. 文件处理,文本清洗和数据划分的python工具包。

pnlp

28
Stars
7
Forks
Watchers

NLP预/后处理工具。

kor-text-preprocess

16
Stars
1
Forks
Watchers

Korean text data preprocess toolkit for NLP

grammarify

65
Stars
8
Forks
Watchers

Grammarify is a npm package that safely cleans up text that has mispellings, improper capitalization, lexical illusions, among other things.