extruct
extruct copied to clipboard
Extract embedded metadata from HTML markup
Motivation: https://github.com/scrapinghub/extruct/issues/37#issuecomment-284255022
At present, `extruct` supports a HTTP API for "testing", but that carries a maintenance burden, and it invites feature-requests that may nudge it more and more into becoming a monolithic...
Fix this issue : ``` /usr/lib/python3/dist-packages/extruct/rdfa.py:88: SyntaxWarning: invalid escape sequence '\s' match = re.search(prefix + ": [^\s]+", head_element.get("prefix")) ```
Hi, Bug Debian https://bugs.debian.org/1063330 ``` /usr/lib/python3/dist-packages/extruct/rdfa.py:88: SyntaxWarning: invalid escape sequence '\s' match = re.search(prefix + ": [^\s]+", head_element.get("prefix")) ```
# Unicode exception bodge/fix - Added in another try/except block to catch a unicode error that was being through when we read LD+JSON structued data from one particular site/page
`extruct` uses `rdflib==4.2.2`, which causes this warning: ``` /.../python3.8/site-packages/rdflib/plugins/parsers/pyRdfa/utils.py:19 /.../python3.8/site-packages/rdflib/plugins/parsers/pyRdfa/utils.py:19: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses import os, os.path,...
Given a HTML structure... ``` .... A$140 A$199 You save: A$59 (30% Off) ``` I'm unable to extract these two `meta` tags from body of the html. This is what...
I made a small script in order to try the scrapping process. I have a case when If I use extruct as the CLI, I get lots of information about...
I'd like to bring to your attention that we are [discussing](https://bugs.launchpad.net/lxml/+bug/1958539) the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning...
Selectolax - https://github.com/rushter/selectolax#simple-benchmark - Selectolax w/ Lexbor, Selectolax with Modest, html_parser, lxml, BeautifulSoup - #209 - LXML has probably had more review than Selectolax, Lexbor, or Modest