extruct
extruct copied to clipboard
Ignore empty OpenGraph props
Ignore empty OpenGraph properties or content.
Fixes #117
Upgraded six>=1.11 to make Python 3.4 tests pass.
This was also done in https://github.com/scrapinghub/extruct/pull/119
Codecov Report
Merging #120 into master will increase coverage by
0.05%. The diff coverage is100%.
@@ Coverage Diff @@
## master #120 +/- ##
==========================================
+ Coverage 87.63% 87.68% +0.05%
==========================================
Files 11 11
Lines 469 471 +2
Branches 101 102 +1
==========================================
+ Hits 411 413 +2
Misses 52 52
Partials 6 6
| Impacted Files | Coverage Δ | |
|---|---|---|
| extruct/_extruct.py | 73.33% <ø> (ø) |
:arrow_up: |
| extruct/opengraph.py | 100% <100%> (ø) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 50a0915...d4645ef. Read the comment docs.
I believe I adressed all the comments.
The Travis build seems to constantly fail with Python 3.4 at "pip install -U tox codecov".
@croqaz I have noticed that the same is happening to master, see: https://travis-ci.org/scrapinghub/extruct/builds/558857852?utm_source=github_status&utm_medium=notification
Digging deeper, the failure in the logs is:
FileNotFoundError: [Errno 2] No such file or directory: '/home/travis/virtualenv/python3.4.6/lib/python3.4/site-packages/six-1.10.0.dist-info/METADATA'
I googled and this could be relevant: https://github.com/tox-dev/tox-travis/issues/76. Could you have a look?
Squashed all commits to have a clean history on merge.
In https://github.com/scrapinghub/extruct/issues/117 idea was to only drop empty values when deciding which value to choose, to be on a safe side. But maybe we can drop them always, as in this PR. @croqaz could you plese check that properties with empty values are always ignored, e.g. using https://developers.facebook.com/tools/debug/sharing/?
@kmike Just to make sure I understand this correctly, because I think I might have misunderstood how to solve the problem.
This is the problem:
<meta property="og:description" content="" />
<blah-blah>... </blah-blah> ...
<meta property="og:description" content="My cool website" />
Duplicate OpenGraph with the same name, but one of them has an empty value. In many cases, the empty value is the first, but it might be second just as well.
Solution:
- Do we want to completely drop OpenGraph meta tags with empty value?
- Do we want to keep the tags with empty value, but move them at the end of the values list, so they have a lower priority?
In this PR I implemented the first option.
Oh, Facebook's Debug tool only shows one of the OpenGraph meta tags. It seems to be the most relevant. But I don't think they care much about the order - in case of images, I'm not 100% they always take the first one - but I might be wrong, because I don't know how to find a website that has multiple images in a "random" order and check what Facebook would extract...
The question is: if there is only a single property with an empty value, is it returned or dropped?
Ok, I understand now. I wonder if we can quickly craft a page like that and feed that to Facebook and see how it works? Because I don't know how to find a page like that "out there".
Actually I might craft a page like that relatively easy. Let me try.
For:
<meta property="og:description" content="" />
<meta property="og:image" content="" />
<meta property="og:title" content="" />
Facebook Sharing Debugger shows:
- Provided og:image URL, was not a valid URL
- og:title - Taken from the
tag - og:description - empty
I also tested with:
<meta property="og:image" content="" />
<meta property="og:image" content="/images/logo.png" />
And Facebook Sharing Debugger shows: "Provided og:image URL, was not a valid URL" But also the logo is corretly identified and displayed as a main image.
So if the image is first and the empty is second, OR if the image is second and the empty is first, the Sharing Debugger still finds the correct image.
Hey guys! Is there something which prevents us from merging that fix? Would be great to have it merged because it blocks a bit a PR in another project.
My cent: Looking at http://ogp.me/ and https://developers.facebook.com/docs/sharing/webmasters/#markup I don't see any definition of a property that could contain an empty value and still have some meaning. It seems that properties cannot be used to label a page somehow. So I would include the removal of every single property without value from the output. What do you think the rest of reviewers?
@ivanprado You mean in this case of: property="og:title" content=" " or property="og:description" content="" both properties should be dropped from the result?
@croqaz well, I have some doubts regarding the fix. By one side, I would say that it should be the user using extruct the one with the responsibility to filter empty values (option 1). extruct should only expose the data in the proper order (appearance order), empty or not.
But by the other side, I have not seen anything in the protocol definition pointing that is possible to create properties with empty values, so removing all of them could be reasonable (option 2). But the question is, it is the same a document with property="og:description" content="" than a document without any annotation? I'm not sure 100%.
The solution of cleaning empty values but only if there is at least one non-empty value for the duplicated property seems like a strange mix (option 3). But maybe could do the job:
- User don't have to deal with empty values by himself, as they are removed
- Meaning could be retained in case of empty content
But is weird as the user still will have to deal with empty value for the particular case. It is a little bit strange.
So in conclusión, I have not a clear view about what we should do, the three options make some sense. The prefered one by my side would be the option 1, but I'm also ok with option 2 and 3. @croqaz @kmike @Gallaecio, what do you think?
I have the same concerns as Ivan. For OpenGraph we return a single value for an attribute, and empty values affect prioritization in case of multiple meta tags, so doing a correct prioritization (non-empty over empty) is fixing a bug.
Removing all empty values is a different thing; it solves the above mention prioritization problem, but it is a larger change, with unclear outcome - it can be a right thing to do (if having an empty value is the same as not having tag at all), or it can be a data loss (if a presence of a tag with empty value is meaningful information). To decide we need to answer this question, and I'm not sure how to do it :) My suggestion was a tool by Facebook (as standard says nothing), and it seems to confirm that empty value is meaningful - if I'm reading @croqaz's results properly, it does show an empty description if a description is present, but empty - and shows nothing if description is absent.