text-extraction topic

List text-extraction repositories

trafilatura

3.0k
Stars
228
Forks
Watchers

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

jusText

688
Stars
78
Forks
Watchers

Heuristic based boilerplate removal tool

sumy

3.4k
Stars
523
Forks
Watchers

Module for automatic summarization of text documents and HTML pages.

cat

90
Stars
18
Forks
Watchers

Extract text from plaintext, .docx, .odt and .rtf files. Pure go.

tika-python

1.4k
Stars
234
Forks
Watchers

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

image-text-localization-recognition

937
Stars
237
Forks
Watchers

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約

srt

436
Stars
45
Forks
Watchers

A simple library and set of tools for parsing, modifying, and composing SRT files.

unipdf

2.4k
Stars
246
Forks
Watchers

Golang PDF library for creating and processing PDF files (pure go)

breadability

203
Stars
26
Forks
Watchers

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

lambda-text-extractor

172
Stars
40
Forks
Watchers

AWS Lambda functions to extract text from various binary formats.