extruct icon indicating copy to clipboard operation
extruct copied to clipboard

Extract embedded metadata from HTML markup

Results 61 extruct issues
Sort by recently updated
recently updated
newest added

I have one site with HTML strings, where I have really slow extraction times (~60 seconds). I just call `extruct.extract` with this string: https://pastebin.com/QJbUdaA6 Other strings work in times like...

When trying to extract this: {"@context":"https:\/\/schema.org\/","@graph":[{"@context":"https:\/\/schema.org\/","@type":"BreadcrumbList","itemListElement":[{"@type":"ListItem","position":1,"item":{"name":"\u05e2\u05de\u05d5\u05d3 \u05d4\u05d1\u05d9\u05ea","@id":"https:\/\/www.sollan.co.il"}},{"@type":"ListItem","position":2,"item":{"name":"\u05de\u05e6\u05d1\u05e8\u05d9\u05dd \u05e4\u05e8\u05d9\u05e7\u05d4 \u05e2\u05de\u05d5\u05e7\u05d4","@id":"https:\/\/www.sollan.co.il\/product-category\/%d7%9e%d7%a6%d7%91%d7%a8%d7%99%d7%9d-%d7%a4%d7%a8%d7%99%d7%a7%d7%94-%d7%a2%d7%9e%d7%95%d7%a7%d7%94\/"}},{"@type":"ListItem","position":3,"item":{"name":"\u05de\u05e6\u05d1\u05e8 \u05e8\u05db\u05d1 \u05e1\u05d8\u05d0\u05e8\u05d8 \u05e1\u05d8\u05d5\u05e4","@id":"https:\/\/www.sollan.co.il\/product-category\/%d7%9e%d7%a6%d7%91%d7%a8%d7%99%d7%9d-%d7%a4%d7%a8%d7%99%d7%a7%d7%94-%d7%a2%d7%9e%d7%95%d7%a7%d7%94\/%d7%9e%d7%a6%d7%91%d7%a8-%d7%a8%d7%9b%d7%91-%d7%a1%d7%98%d7%90%d7%a8%d7%98-%d7%a1%d7%98%d7%95%d7%a4\/"}},{"@type":"ListItem","position":4,"item":{"name":"\u05de\u05e6\u05d1\u05e8 80 \u05d0\u05de\u05e4\u05e8 \u05d5\u05d5\u05e8\u05d8\u05d4 AGM \u05e1\u05d0\u05e8\u05d8 \u05e1\u05d8\u05d5\u05e4 80Ah stop & start \u05d5\u05e8\u05d8\u05d4","@id":"https:\/\/www.sollan.co.il\/product\/%d7%9e%d7%a6%d7%91%d7%a8-80-%d7%90%d7%9e%d7%a4%d7%a8-%d7%95%d7%95%d7%a8%d7%98%d7%94-agm-%d7%a1%d7%90%d7%a8%d7%98-%d7%a1%d7%98%d7%95%d7%a4-80ah-stop-start-%d7%95%d7%a8%d7%98%d7%94\/"}}]},{"@context":"https:\/\/schema.org\/","@type":"Product","@id":"https:\/\/www.sollan.co.il\/product\/%d7%9e%d7%a6%d7%91%d7%a8-80-%d7%90%d7%9e%d7%a4%d7%a8-%d7%95%d7%95%d7%a8%d7%98%d7%94-agm-%d7%a1%d7%90%d7%a8%d7%98-%d7%a1%d7%98%d7%95%d7%a4-80ah-stop-start-%d7%95%d7%a8%d7%98%d7%94\/#product","name":"\u05de\u05e6\u05d1\u05e8 80 \u05d0\u05de\u05e4\u05e8 \u05d5\u05d5\u05e8\u05d8\u05d4 AGM \u05e1\u05d0\u05e8\u05d8 \u05e1\u05d8\u05d5\u05e4 80Ah...

As @wjdp suggested in the issue #171 , an apostrophe in the channel's name causes the JSONdecode error. the json.loads() function fails when there are hex codes like "\\x27" in...

I have added the twitter card functionality. So now it extracts namespaces and properties of the twitter cards. I have also added 3 test cases This was a needed feature...

#192 Added the feature to incorporate all the meta tags outside of the html head, by changing in the function extract_items() in class openClassExtractor. Furthermore, added a test case to...

On some pages meta tags are included outside of the head tag. For example on the YouTube channel page: https://www.youtube.com/c/Freecodecamp As the opengraph extractor only looks in the head tag,...

Would there be any interest in adding twitter card tags (detailed [here](https://developer.twitter.com/en/docs/twitter-for-websites/cards/overview/markup))? I'd be willing to work on this if there's any interest and submit a pull request.

I have some code to pull metadata from YouTube ```python response = requests.get(video_url) metadata = extruct.extract(response.text, base_url="https://youtube.com") ``` Have noticed some recent crashing, but only on some videos. No crash:...

Test Url: https://www.fabucci.ie/ladies-shoes/marian-gold-stiletto-with-black-toe-cap.html Schema.org Structured Data Testing Tool for same: https://validator.schema.org/#url=https%3A%2F%2Fwww.fabucci.ie%2Fladies-shoes%2Fmarian-gold-stiletto-with-black-toe-cap.html On this page there is 1 product with embedded microdata structured data. The product has 3 images and the...

Hi all. If there is ld+json outside html element (html.head.body.html.ld+json) then parser returns empty list. Firefox and W3C validator say: Stray start tag "script". So it is clear that site...