extruct
extruct copied to clipboard
Extract embedded metadata from HTML markup
This PR alters the behaviour such that If there are invalid jsonld elements with valid elements on the page source, it returns only the valid jsonld elements.
`data = extruct.extract(r.text, base_url=base_url) /Users/divyanshu/flask/lib/python3.6/site-packages/rdflib_jsonld/__init__.py:12: DeprecationWarning: The rdflib-jsonld package has been integrated into rdflib as of rdflib==6.0.1. Please remove rdflib-jsonld from your project's dependencies. DeprecationWarning, Traceback (most recent call last):...
The one in https://github.com/scrapinghub/extruct#all-in-one-extraction which is requesting ``'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'`` does not work any more, it returns no data - perhaps we should self-host some example.
When scraping the URL: http://www.reakes.com/ Error: ``` Traceback (most recent call last): File ".../website_scraping.py", line 161, in getWebsiteScrapedDataForURL data = extruct.extract(r.text, base_url=base_url) File "/usr/local/lib/python3.8/site-packages/extruct/_extruct.py", line 58, in extract tree =...
Hello, Using the CLI version I get the following error message when using the json-ld: ```cmd extruct "https://www.drogaraia.com.br/nivea-desodorante-aerosol-deep-original-150ml.html" --syntaxes json-ld Failed to extract json-ld, raises Expecting value: line 1 column...
Consider this HTML construct: ``` ``` This is turned into the following by the opengraph module: ``` [ "og:title", "‘A Path That Is Not Sustainable’: College Hits Breaking Point, Sends...
I'm not sure if this is an issue with extruct, or if there's anything I can do with extruct (or otherwise) to get around this issue, but I've been running...
I noticed that the matching order of _extract_property_value seems to be inconsistent with https://www.w3.org/TR/microdata/#values. In this doc, it mentions that the 2nd matching case is "If the element has a...
Fixes #143
`RDFaExtractor` already supports expanded mode via `expanded` parameter but it seems that currently there's no way to pass it when using "All-in-one" extraction via `extruct.extract` It'd be nice to add...