[WIP] Extract opengraph from body as well
Normally opengraph <meta property=".." content=".."> tags are in the head, but having them in the body is also surprisingly common - in our internal article dataset they are present in body on 5% of all pages (out of all pages with such tags anywhere on the page), and on 12% for products.
One such example is https://www.reuters.com/article/us-health-coronavirus-apple/coronavirus-case-at-apples-irish-hq-trinity-college-goes-online-idUSKBN20X1QT - so it's even on a popular website.
TODO:
- [ ] add tests
- [ ] double-check what is happening with namespaces
Codecov Report
Merging #129 into master will not change coverage by
%. The diff coverage is100.00%.
@@ Coverage Diff @@
## master #129 +/- ##
=======================================
Coverage 87.78% 87.78%
=======================================
Files 11 11
Lines 475 475
Branches 103 103
=======================================
Hits 417 417
Misses 52 52
Partials 6 6
| Impacted Files | Coverage Δ | |
|---|---|---|
| extruct/opengraph.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update a365dc0...1187d9d. Read the comment docs.