extraction icon indicating copy to clipboard operation
extraction copied to clipboard

A Python library for extracting titles, images, descriptions and canonical urls from HTML.

Results 6 extraction issues
Sort by recently updated
recently updated
newest added

Though the `README.md` hinted that `lxml` will be used if available, the choice of parsers was forced to be only `html5lib` in the code. Also, have added checks to parse...

Currently only canonical urls are extracted. It would be fairly easy to include a technique to also include outgoing links, and possibly also relative links and images. Maybe these shouldnt...

Fixed this warning from BeautifulSoup ``` UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but...

I don't know if this is on the roadmap for this project, but it would be nice! Even without rewriting everything, it might be possible to easily support python 3...

python2 standard library `urlparse` has been renamed to `urllib.parse` in python3. I changed import clause compatible throughout python2~python3.