text-extraction topic
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
benchmarks
Benchmarking PDF libraries
pd3f-core
📑 Python Package to reconstruct the original continuous text from PDFs with language models
wagtail_textract
Text extraction for Wagtail document search
office-text-extractor
Yet another library to extract text from MS Office and PDF files
docwire
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boo...
scummtr
Fan translation tools for LucasArts SCUMM games
pdf-text-data-extractor
PDF text data extraction web app with OCR for scanned documents
video_text_detection
Bachelor Thesis | Text extraction from complex video scenes
tesseract-ocr-wrapper
This is a highly efficient python wrapper for tesseract-ocr.