text-extraction topic
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
jusText
Heuristic based boilerplate removal tool
sumy
Module for automatic summarization of text documents and HTML pages.
cat
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
tika-python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
image-text-localization-recognition
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
srt
A simple library and set of tools for parsing, modifying, and composing SRT files.
unipdf
Golang PDF library for creating and processing PDF files (pure go)
breadability
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
lambda-text-extractor
AWS Lambda functions to extract text from various binary formats.