content-extraction topic
boilerpipe-ruby
Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles
readability2
Readability2 converts HTML to plain text.
extractnet
A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package
learnhtml
Web content extraction using machine learning
sumo
Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more
nextjs-pdf-parser
Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.
pdfix_sdk_example_cpp
Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...
firecrawl-mcp-server
Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.