extraction
extraction copied to clipboard
A Python library for extracting titles, images, descriptions and canonical urls from HTML.
Though the `README.md` hinted that `lxml` will be used if available, the choice of parsers was forced to be only `html5lib` in the code. Also, have added checks to parse...
Currently only canonical urls are extracted. It would be fairly easy to include a technique to also include outgoing links, and possibly also relative links and images. Maybe these shouldnt...
Fixed this warning from BeautifulSoup ``` UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but...
I don't know if this is on the roadmap for this project, but it would be nice! Even without rewriting everything, it might be possible to easily support python 3...
python2 standard library `urlparse` has been renamed to `urllib.parse` in python3. I changed import clause compatible throughout python2~python3.