extruct icon indicating copy to clipboard operation
extruct copied to clipboard

[WIP] Extract opengraph from body as well

Open lopuhin opened this issue 5 years ago • 1 comments

Normally opengraph <meta property=".." content=".."> tags are in the head, but having them in the body is also surprisingly common - in our internal article dataset they are present in body on 5% of all pages (out of all pages with such tags anywhere on the page), and on 12% for products.

One such example is https://www.reuters.com/article/us-health-coronavirus-apple/coronavirus-case-at-apples-irish-hq-trinity-college-goes-online-idUSKBN20X1QT - so it's even on a popular website.

TODO:

  • [ ] add tests
  • [ ] double-check what is happening with namespaces

lopuhin avatar Apr 08 '20 08:04 lopuhin

Codecov Report

Merging #129 into master will not change coverage by %. The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #129   +/-   ##
=======================================
  Coverage   87.78%   87.78%           
=======================================
  Files          11       11           
  Lines         475      475           
  Branches      103      103           
=======================================
  Hits          417      417           
  Misses         52       52           
  Partials        6        6           
Impacted Files Coverage Δ
extruct/opengraph.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a365dc0...1187d9d. Read the comment docs.

codecov[bot] avatar Apr 08 '20 08:04 codecov[bot]