extruct icon indicating copy to clipboard operation
extruct copied to clipboard

Ignore empty OpenGraph props

Open croqaz opened this issue 6 years ago • 15 comments

Ignore empty OpenGraph properties or content.

Fixes #117

Upgraded six>=1.11 to make Python 3.4 tests pass.

This was also done in https://github.com/scrapinghub/extruct/pull/119

croqaz avatar Jul 16 '19 13:07 croqaz

Codecov Report

Merging #120 into master will increase coverage by 0.05%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #120      +/-   ##
==========================================
+ Coverage   87.63%   87.68%   +0.05%     
==========================================
  Files          11       11              
  Lines         469      471       +2     
  Branches      101      102       +1     
==========================================
+ Hits          411      413       +2     
  Misses         52       52              
  Partials        6        6
Impacted Files Coverage Δ
extruct/_extruct.py 73.33% <ø> (ø) :arrow_up:
extruct/opengraph.py 100% <100%> (ø) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 50a0915...d4645ef. Read the comment docs.

codecov[bot] avatar Jul 16 '19 15:07 codecov[bot]

I believe I adressed all the comments.

The Travis build seems to constantly fail with Python 3.4 at "pip install -U tox codecov".

croqaz avatar Jul 17 '19 09:07 croqaz

@croqaz I have noticed that the same is happening to master, see: https://travis-ci.org/scrapinghub/extruct/builds/558857852?utm_source=github_status&utm_medium=notification

Digging deeper, the failure in the logs is:

FileNotFoundError: [Errno 2] No such file or directory: '/home/travis/virtualenv/python3.4.6/lib/python3.4/site-packages/six-1.10.0.dist-info/METADATA'

I googled and this could be relevant: https://github.com/tox-dev/tox-travis/issues/76. Could you have a look?

ivanprado avatar Jul 17 '19 10:07 ivanprado

Squashed all commits to have a clean history on merge.

croqaz avatar Jul 17 '19 10:07 croqaz

In https://github.com/scrapinghub/extruct/issues/117 idea was to only drop empty values when deciding which value to choose, to be on a safe side. But maybe we can drop them always, as in this PR. @croqaz could you plese check that properties with empty values are always ignored, e.g. using https://developers.facebook.com/tools/debug/sharing/?

kmike avatar Jul 19 '19 13:07 kmike

@kmike Just to make sure I understand this correctly, because I think I might have misunderstood how to solve the problem.

This is the problem:

<meta property="og:description" content="" />
<blah-blah>... </blah-blah> ...
<meta property="og:description" content="My cool website" />

Duplicate OpenGraph with the same name, but one of them has an empty value. In many cases, the empty value is the first, but it might be second just as well.

Solution:

  1. Do we want to completely drop OpenGraph meta tags with empty value?
  2. Do we want to keep the tags with empty value, but move them at the end of the values list, so they have a lower priority?

In this PR I implemented the first option.

croqaz avatar Jul 19 '19 14:07 croqaz

Oh, Facebook's Debug tool only shows one of the OpenGraph meta tags. It seems to be the most relevant. But I don't think they care much about the order - in case of images, I'm not 100% they always take the first one - but I might be wrong, because I don't know how to find a website that has multiple images in a "random" order and check what Facebook would extract...

croqaz avatar Jul 19 '19 14:07 croqaz

The question is: if there is only a single property with an empty value, is it returned or dropped?

kmike avatar Jul 19 '19 14:07 kmike

Ok, I understand now. I wonder if we can quickly craft a page like that and feed that to Facebook and see how it works? Because I don't know how to find a page like that "out there".

Actually I might craft a page like that relatively easy. Let me try.

croqaz avatar Jul 19 '19 14:07 croqaz

For:

<meta property="og:description" content="" />
<meta property="og:image" content="" />
<meta property="og:title" content="" />

Facebook Sharing Debugger shows:

  • Provided og:image URL, was not a valid URL
  • og:title - Taken from the tag
  • og:description - empty

croqaz avatar Jul 19 '19 14:07 croqaz

I also tested with:

<meta property="og:image" content="" />
<meta property="og:image" content="/images/logo.png" />

And Facebook Sharing Debugger shows: "Provided og:image URL, was not a valid URL" But also the logo is corretly identified and displayed as a main image.

So if the image is first and the empty is second, OR if the image is second and the empty is first, the Sharing Debugger still finds the correct image.

croqaz avatar Jul 19 '19 14:07 croqaz

Hey guys! Is there something which prevents us from merging that fix? Would be great to have it merged because it blocks a bit a PR in another project.

jakubwasikowski avatar Jul 25 '19 12:07 jakubwasikowski

My cent: Looking at http://ogp.me/ and https://developers.facebook.com/docs/sharing/webmasters/#markup I don't see any definition of a property that could contain an empty value and still have some meaning. It seems that properties cannot be used to label a page somehow. So I would include the removal of every single property without value from the output. What do you think the rest of reviewers?

ivanprado avatar Jul 25 '19 12:07 ivanprado

@ivanprado You mean in this case of: property="og:title" content=" " or property="og:description" content="" both properties should be dropped from the result?

croqaz avatar Jul 25 '19 13:07 croqaz

@croqaz well, I have some doubts regarding the fix. By one side, I would say that it should be the user using extruct the one with the responsibility to filter empty values (option 1). extruct should only expose the data in the proper order (appearance order), empty or not.

But by the other side, I have not seen anything in the protocol definition pointing that is possible to create properties with empty values, so removing all of them could be reasonable (option 2). But the question is, it is the same a document with property="og:description" content="" than a document without any annotation? I'm not sure 100%.

The solution of cleaning empty values but only if there is at least one non-empty value for the duplicated property seems like a strange mix (option 3). But maybe could do the job:

  • User don't have to deal with empty values by himself, as they are removed
  • Meaning could be retained in case of empty content

But is weird as the user still will have to deal with empty value for the particular case. It is a little bit strange.

So in conclusión, I have not a clear view about what we should do, the three options make some sense. The prefered one by my side would be the option 1, but I'm also ok with option 2 and 3. @croqaz @kmike @Gallaecio, what do you think?

ivanprado avatar Jul 25 '19 13:07 ivanprado

I have the same concerns as Ivan. For OpenGraph we return a single value for an attribute, and empty values affect prioritization in case of multiple meta tags, so doing a correct prioritization (non-empty over empty) is fixing a bug.

Removing all empty values is a different thing; it solves the above mention prioritization problem, but it is a larger change, with unclear outcome - it can be a right thing to do (if having an empty value is the same as not having tag at all), or it can be a data loss (if a presence of a tag with empty value is meaningful information). To decide we need to answer this question, and I'm not sure how to do it :) My suggestion was a tool by Facebook (as standard says nothing), and it seems to confirm that empty value is meaningful - if I'm reading @croqaz's results properly, it does show an empty description if a description is present, but empty - and shows nothing if description is absent.

kmike avatar Jul 25 '19 14:07 kmike