ache
ache copied to clipboard
Fully rewrite the HTML parser
Current implementation is messy, very hard to maintain, and make changes. New implementation should be compatible with current one and add new features:
- [x] Should normalize relative links
- [x] Should validate links and discard invalid ones
- [x] Should extract deep web
.onion
links - [x] Should extract anchor text
- [x] Should extract text around links
- [ ] Should extract meta-tags (description, keywords, etc)
- [x] Should decode HTML entities to regular characters (turn
&
into&
) from links - [x] Should decode HTML entities to regular characters (turn
&
into&
) from text - [x] Should remove the fragment portion of the URL (anything after the character #)
- [x] Should do basic link normalization (lowercase domain, reorder query parameters, etc)
- [ ] NEW: Extract links to images and regular links separately
- [ ] NEW: Allow for easy extensions such as extraction of meta tags such as
og:description
,og:title
, etc - [ ] etc