document-parsing topic

List document-parsing repositories

PaddleOCR

65.4k
Stars
9.4k
Forks
65.4k
Watchers

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

unstructured

8.6k
Stars
702
Forks
Watchers

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

edenai-apis

374
Stars
53
Forks
Watchers

Eden AI: simplify the use and deployment of AI technologies by providing a unique API that connects to the best possible AI engines

papercast

32
Stars
1
Forks
Watchers

A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines...

community

19
Stars
6
Forks
Watchers

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

docling

22.8k
Stars
1.3k
Forks
Watchers

Get your documents ready for gen AI

docstrange

1.0k
Stars
98
Forks
1.0k
Watchers

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.