ache Fully rewrite the HTML parser

Fully rewrite the HTML parser

Open aecio opened this issue 7 years ago • 0 comments

Current implementation is messy, very hard to maintain, and make changes. New implementation should be compatible with current one and add new features:

[x] Should normalize relative links
[x] Should validate links and discard invalid ones
[x] Should extract deep web .onion links
[x] Should extract anchor text
[x] Should extract text around links
[ ] Should extract meta-tags (description, keywords, etc)
[x] Should decode HTML entities to regular characters (turn & into &) from links
[x] Should decode HTML entities to regular characters (turn & into &) from text
[x] Should remove the fragment portion of the URL (anything after the character #)
[x] Should do basic link normalization (lowercase domain, reorder query parameters, etc)
[ ] NEW: Extract links to images and regular links separately
[ ] NEW: Allow for easy extensions such as extraction of meta tags such as og:description, og:title, etc
[ ] etc

May 23 '17 22:05 aecio

ache ache copied to clipboard

Fully rewrite the HTML parser

ache
ache copied to clipboard