extruct icon indicating copy to clipboard operation
extruct copied to clipboard

The emitter opengraph produces rigid, fragile structures

Open boxabirds opened this issue 3 years ago • 4 comments

Consider this HTML construct:

<meta property="og:title" content="‘A Path That Is Not Sustainable’: College Hits Breaking Point, Sends Home Majority of Student Body" />

This is turned into the following by the opengraph module:

       [
          "og:title",
          "‘A Path That Is Not Sustainable’: College Hits Breaking Point, Sends Home Majority of Student Body"
        ],

Does anyone know why this couldn't instead be:

"og:title":"‘A Path That Is Not Sustainable’: College Hits Breaking Point, Sends Home Majority of Student Body",

A case for having the output be this way:

  1. You can then do semantic selectors instead of positional selectors (["og:title"] vs [1])
  2. If there are multiple occurrences with an og:tag then I think this is a bug, certainly for the purposes of extruct. Or maybe I'm wrong on this.

Thoughts most welcome.

boxabirds avatar Sep 08 '20 21:09 boxabirds

@boxabirds that's a good question. The output with uniform=True is already a dict like you describe: https://github.com/scrapinghub/extruct#uniform - so my understanding is that by default we do minimal post-processing, but also provide the data in a more convenient format.

lopuhin avatar Sep 09 '20 06:09 lopuhin

Uniform =true great I’ll try that thanks!

On Wed, 9 Sep 2020 at 07:03, Konstantin Lopuhin [email protected] wrote:

@boxabirds https://github.com/boxabirds that's a good question. The output with uniform=True is already a dict like you describe: https://github.com/scrapinghub/extruct#uniform - so my understanding is that by default we do minimal post-processing, but also provide the data in a more convenient format.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scrapinghub/extruct/issues/153#issuecomment-689327040, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABD62PEAJICV22EAOHN7CLSE4LDTANCNFSM4RAQUSHA .

boxabirds avatar Sep 09 '20 06:09 boxabirds

So maybe we could update the docs so it emphasises that the uniform=True flag should be used in most new cases. I'd encourage the default to be this but that'd likely break your current users.

boxabirds avatar Sep 09 '20 10:09 boxabirds

If there are multiple occurrences with an og:tag then I think this is a bug, certainly for the purposes of extruct. Or maybe I'm wrong on this.

OpenGraph has arrays and elements where order between tags matters: https://ogp.me/#array Is this covered?

sknebel avatar Aug 02 '21 17:08 sknebel