content-extraction topic

List content-extraction repositories

boilerpipe-ruby

40
Stars
5
Forks
Watchers

Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles

readability2

107
Stars
15
Forks
Watchers

Readability2 converts HTML to plain text.

extractnet

182
Stars
20
Forks
Watchers

A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

learnhtml

32
Stars
9
Forks
Watchers

Web content extraction using machine learning

sumo

19
Stars
5
Forks
Watchers

Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more

nextjs-pdf-parser

37
Stars
6
Forks
Watchers

Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.

pdfix_sdk_example_cpp

16
Stars
4
Forks
Watchers

Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...