ache icon indicating copy to clipboard operation
ache copied to clipboard

Fully rewrite the HTML parser

Open aecio opened this issue 7 years ago • 0 comments

Current implementation is messy, very hard to maintain, and make changes. New implementation should be compatible with current one and add new features:

  • [x] Should normalize relative links
  • [x] Should validate links and discard invalid ones
  • [x] Should extract deep web .onion links
  • [x] Should extract anchor text
  • [x] Should extract text around links
  • [ ] Should extract meta-tags (description, keywords, etc)
  • [x] Should decode HTML entities to regular characters (turn & into &) from links
  • [x] Should decode HTML entities to regular characters (turn & into &) from text
  • [x] Should remove the fragment portion of the URL (anything after the character #)
  • [x] Should do basic link normalization (lowercase domain, reorder query parameters, etc)
  • [ ] NEW: Extract links to images and regular links separately
  • [ ] NEW: Allow for easy extensions such as extraction of meta tags such as og:description, og:title, etc
  • [ ] etc

aecio avatar May 23 '17 22:05 aecio