PHPScraper icon indicating copy to clipboard operation
PHPScraper copied to clipboard

Parsing structured data (ld+json)

Open spekulatius opened this issue 5 years ago • 4 comments

It would make sense to parse the structured data JSON provided by some sites within the head-tag. This way the already accessed information from the meta tags could be made more robust and possibility extended later on.

Ref: https://developers.google.com/search/docs/data-types/article

spekulatius avatar Aug 24 '20 12:08 spekulatius

Context: https://json-ld.org/

spekulatius avatar Aug 27 '20 11:08 spekulatius

Some thoughts:

A website can contain multiple JSONLD blocks. It seems possible to combine them ( https://stackoverflow.com/a/48295719 ) - probably, we should use the Array notation:

[
  {
     "@context": "http://schema.org",
     "@type": "Organization"
  },
  {
     "@context": "http://schema.org",
     "@type": "BreadcrumbList"
  }
]

Would it make sense to always return an array - even if the page contains only one JSONLD block? (probably yes)

eposjk avatar Dec 08 '22 23:12 eposjk

Hey @eposjk,

good point on the multiple ld+json blocks.

Yeah, if data exists in multiple positions we should go for an array. It might be only one element, but at least it's future proof. Merging blocks into one might be an option too.`

Cheers, Peter

spekulatius avatar Dec 09 '22 09:12 spekulatius

This is what I'm using:

        $jsonLd = [];
        foreach ($dom->getElementsByTagName('script') as $script) {
            if ($script->getAttribute('type') === 'application/ld+json') {
                $json_txt = preg_replace('@/\*.*?\*/@', '', $script->textContent);
                $json_txt = preg_replace("/\r|\n/", ' ', trim($json_txt));
                $schema = json_decode($json_txt, true);
                if (isset($schema['@graph'])) {
                    $jsonLd += $schema['@graph'];
                } else {
                    $jsonLd[] = $schema;
                }
            }
        }

joshua-bn avatar Jan 19 '23 04:01 joshua-bn