extruct icon indicating copy to clipboard operation
extruct copied to clipboard

Extract embedded metadata from HTML markup

Results 62 extruct issues
Sort by recently updated
recently updated
newest added

Fixes #53 If we merge this, we should create a separate issue to handle https://github.com/scrapinghub/extruct/issues/53#issuecomment-389053240, which probably requires a custom fallback JSON parser.

~~Currently blocked by https://github.com/scrapinghub/extruct/issues/116~~ Blocked by #146

docs

In the current state of #127, running `tox -e py -- README.rst` and looking at the end of the reported diff between actual output and expectations, the order of the...

bug

Add jsonStringFixer.py, which has a function to add quotes around any required text in a json string. Used this in jsonld.py to handle invalid jsonld string.

Some web pages contain badly formatted JSON-LD data, e.g., [an example](https://www.debenhams.com/webapp/wcs/stores/servlet/prod_10701_10001_60742+1515029001_-1) The JSON-LD in this page is: ``` { "@context": "http://schema.org", "@type": "Product", "name": "Black 'Clint' FT0511 cat eye sunglasses",...

enhancement

``extruct.extruct`` always calls ``parse_xmldom_html`` - it would be convenient if results of this parse call could be passed instead, in case we want to do some custom extractions outside.

Normally opengraph ```` tags are in the head, but having them in the body is also surprisingly common - in our internal article dataset they are present in body on...

Sometimes jsonld schema is prefered in raw json format rather than python dict - this PR implementes `as_json` kwarg bool to determine whether to return python dictionary or json string....

Hi, wanted to ask if anyone out there has used extruct on AWS lambda? I tested running `extruct` function which seems to fail to work for rdfa. Other default metadata...

docs