extruct
extruct copied to clipboard
Some websites put meta tags outside the head.
On some pages meta tags are included outside of the head tag. For example on the YouTube channel page: https://www.youtube.com/c/Freecodecamp
As the opengraph extractor only looks in the head tag, all the og:* meta properties are missed. In my fork, I changed the extractor to look in the body rather.
If I get permission, I can do a PR?
Here is a link to where I made the change: https://github.com/scrapinghub/extruct/blob/c2cffbed26ae4ab8dd35d1860bfda00c3bac5783/extruct/opengraph.py#L28
hi @paul-rchds yes, that would be great - I noticed the same issue myself but didn't get to implement everything required, here is a link to a PR https://github.com/scrapinghub/extruct/pull/129/ - feel free to start a new one.
I have changed the functionality of the extract_item function in OpengraphExtractor class, to incorporate the meta tags outside of the head. Have tested it on the link shared by @paul-rchds . Please review my PR for its workability. Thanks