readable-web-extractor-comparison
readable-web-extractor-comparison copied to clipboard
Manually compare various readable web extractor libraries against different websites
Readable Web Extractor Comparison
How do various readable website extractor libraries (ie. libraries that provide a feature like Reader View in Safari) perform?
This repo exists to provide a way to compare many libraries at once across many pages at once.
Currently the following libraries are implemented:
- mozilla/readability
- cleanview
- metascraper
- @postlight/mercury-parser
- TODO - clean-mark (377 stars)
- TODO - ascrape-js (13 stars)
Results
The latest output from running the comparisons on a set of 16 random pages selected from Hacker News in June 2020 is available on the gh-pages
branch (direct link to report).
Based on these comparisons @awendland is intending to use the mozilla/readability project.
Example Report
Usage
Make sure to run yarn
to ensure all dependencies are installed. Each command should include --help
documentation and produce explanatory output during execution.
Fetching Test Pages
Create a newline delimited list of URLs to fetch and store them in a text file such as test_urls.txt
.
Use the fetch-test-pages
script to retrieve and save them into a folder such as test_pages/
for report processing.
yarn scripts:run ./scripts/fetch-test-pages.ts --listOfUrls test_urls.txt --outDir test_pages/ --parallelism 30
They will be saved as JSON files containing information such as the source URL and the HTML contents of the page.
Generating Comparison Report
Once test pages have been retrieved a report can be generated. The following command would be used to generate a report named report.html
from test pages saved in test_pages/
.
yarn scripts:run ./scripts/generate-report.ts --testPages 'test_pages/*.json' --reportFile report.html
Contributing
Adding New Libraries for Comparison
Adding a new library to the comparison involves several steps:
-
Add the library (and any associated
@types/
package) as a project dependencyyarn add LIBRARY_NAME --exact
-
Authoring an adapter for the library in
scripts/lib/adapters/adapter-LIBRARY_NAME.ts
which conforms to the following type (detailed inscripts/lib/types.ts
):type Adapter = { metadata: AdapterMetadata extract(params: ExtractParams): Promise<ExtractedInfo | null> }
-
Registering the adapter in
scripts/lib/adapters/index.ts
-
Generating a report to make sure that it works