extruct
extruct copied to clipboard
The emitter opengraph produces rigid, fragile structures
Consider this HTML construct:
<meta property="og:title" content="‘A Path That Is Not Sustainable’: College Hits Breaking Point, Sends Home Majority of Student Body" />
This is turned into the following by the opengraph module:
[
"og:title",
"‘A Path That Is Not Sustainable’: College Hits Breaking Point, Sends Home Majority of Student Body"
],
Does anyone know why this couldn't instead be:
"og:title":"‘A Path That Is Not Sustainable’: College Hits Breaking Point, Sends Home Majority of Student Body",
A case for having the output be this way:
- You can then do semantic selectors instead of positional selectors (
["og:title"]
vs[1]
) - If there are multiple occurrences with an
og:tag
then I think this is a bug, certainly for the purposes ofextruct
. Or maybe I'm wrong on this.
Thoughts most welcome.
@boxabirds that's a good question. The output with uniform=True
is already a dict like you describe: https://github.com/scrapinghub/extruct#uniform - so my understanding is that by default we do minimal post-processing, but also provide the data in a more convenient format.
Uniform =true great I’ll try that thanks!
On Wed, 9 Sep 2020 at 07:03, Konstantin Lopuhin [email protected] wrote:
@boxabirds https://github.com/boxabirds that's a good question. The output with uniform=True is already a dict like you describe: https://github.com/scrapinghub/extruct#uniform - so my understanding is that by default we do minimal post-processing, but also provide the data in a more convenient format.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scrapinghub/extruct/issues/153#issuecomment-689327040, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABD62PEAJICV22EAOHN7CLSE4LDTANCNFSM4RAQUSHA .
So maybe we could update the docs so it emphasises that the uniform=True flag should be used in most new cases. I'd encourage the default to be this but that'd likely break your current users.
If there are multiple occurrences with an og:tag then I think this is a bug, certainly for the purposes of extruct. Or maybe I'm wrong on this.
OpenGraph has arrays and elements where order between tags matters: https://ogp.me/#array Is this covered?