extruct icon indicating copy to clipboard operation
extruct copied to clipboard

Some websites put meta tags outside the head.

Open paul-rchds opened this issue 3 years ago • 2 comments

On some pages meta tags are included outside of the head tag. For example on the YouTube channel page: https://www.youtube.com/c/Freecodecamp

As the opengraph extractor only looks in the head tag, all the og:* meta properties are missed. In my fork, I changed the extractor to look in the body rather.

If I get permission, I can do a PR?

Here is a link to where I made the change: https://github.com/scrapinghub/extruct/blob/c2cffbed26ae4ab8dd35d1860bfda00c3bac5783/extruct/opengraph.py#L28

paul-rchds avatar Apr 13 '22 09:04 paul-rchds

hi @paul-rchds yes, that would be great - I noticed the same issue myself but didn't get to implement everything required, here is a link to a PR https://github.com/scrapinghub/extruct/pull/129/ - feel free to start a new one.

lopuhin avatar Apr 14 '22 07:04 lopuhin

I have changed the functionality of the extract_item function in OpengraphExtractor class, to incorporate the meta tags outside of the head. Have tested it on the link shared by @paul-rchds . Please review my PR for its workability. Thanks

frostrot avatar May 13 '22 18:05 frostrot