pygeoapi Embed jsonld snippet directly, without requesting with ajax

Goal is to:

ggl rich snippets test does not parse the ajax response (not clear if rich snippets test is fully representative for google structured data crawler)
better performance

This is a work in progress, now implemented for /collections/foo and collections/foo/items/faa

May 30 '22 21:05 pvgenuchten

The code uses copy.deepcopy to copy the content item, to be used for jsonld generation. This relates to the fact that the content item seems altered during jsonld conversion

I'm using metadata.license directly from config, as fcmld is not available on html api class

May 30 '22 21:05 pvgenuchten

Not sure what the search engines are doing (so whether or not this is required for structured data parsing), but the idea of using AJAX to request it separately is already to improve performance by making the request for that data separate from the page load itself. That is, a human can see the page as quickly as possible, and then the search-engine-oriented data component can load in the background. One hopes a search engine is using conneg to preferentially load JSON-LD over HTML, or at least looking at rel=alternate or something.

Introducing a new way to get JSON-LD representation that is not via conneg seems a little worrying (e.g. "as fcmld is not available on html api class").

May 31 '22 01:05 alpha-beta-soup

Hi all,

Why not just use python machinery to embed the result of a URL request for the json-ld. I was able to do this pretty easily here. For the sake of proof of concept, I opted to use requests which I think breaks some things.

Jun 07 '22 20:06 webb-ben

Why not just use python machinery to embed the result of a URL request for the json-ld.

No need to make an actual request, this PR actually does the same thing by calling the function that normally outputs the json-ld

Jun 11 '22 19:06 pvgenuchten

One hopes a search engine is using conneg to preferentially load JSON-LD over HTML, or at least looking at rel=alternate or something.

unfortunately no, both yandex and ggl (haven't tried bing yet) seem to not use conneg or rel=alternate

Jun 11 '22 19:06 pvgenuchten

Introducing a new way to get JSON-LD representation that is not via conneg seems a little worrying (e.g. "as fcmld is not available on html api class").

Note that this PR does not add a http mechanism to get json-ld representation, it adds json-ld content to an existing http mechanism. I do not know the consequences of "fcmld is not available on html api class"

Jun 11 '22 19:06 pvgenuchten

@pvgenuchten @alpha-beta-soup @webb-ben how well is XHR supported in search engine crawlers? This will help us determine the direction on whether to XHR or embed.

Jun 26 '23 14:06 tomkralidis

Why not just use python machinery to embed the result of a URL request for the json-ld.

Yes, it would work, but I’d prefer not to add another http request, if we can manage fully on python side

Jun 27 '23 05:06 pvgenuchten

XHR is supported by the google crawler

Think there is a bigger question at hand regarding best practice for SEO and crawlability (see https://github.com/geopython/pygeoapi/issues/877 and https://github.com/geopython/pygeoapi/pull/902). SEO is graded in part based on how long it takes for the initial page to load - to that extent the XHR implementation is beneficial.

Separately requesting the JSON-LD via XHR request does increase the CPU processing time. This PR pumps the already computed result into the json-ld script tag.

Jun 28 '23 16:06 webb-ben

@webb-ben you indicate xhr is supported? however my impression is that XHR is not supported by google to ingest the specific json-ld payload embedded in html, because in that scenario i'd expect the rich results test to identify a schema.org/dataset on https://demo.pygeoapi.io/master/collections/obs?f=html, which it does not:

However when i inject the jsonld block manually, and run the test again, it works fine

Jul 02 '23 14:07 pvgenuchten

When I try to crawl https://demo.pygeoapi.io/master/collections/obs?f=html (ref), I get that it is blocked by robots.txt.

However, crawling https://reference.geoconnex.us/collections/nat_aq?f=html running 0.15.dev0 (ref) and crawling https://features.internetofwater.dev/collections/NPDES?f=html running 0.16.dev0 (ref) appear to work as expected

Jul 02 '23 15:07 webb-ben

The Google Rich Result test as a test needs to be better clarified in this conversation. For instance, I was not able to find a version of pygeoapi landing page with any detect rich text. So I went through the OGC API feature hypermedia pattern:

API Endpoint	Rich Result Test URL	Result
`/`	Test URL 1	❌
`/collections`	Test URL 2	✅
`/collections/NPDES`	Test URL 3	✅
`/collections/NPDES/items`	Test URL 4	❌
`/collections/NPDES/items/04R10I001`	Test URL 5	❌

These all have valid structured data in the schema.org validator that google provides a redirect to. That is in part because some of these pages do not constitute an acceptable webpage type and thus is ignored by the crawler.

Jul 13 '23 17:07 webb-ben

Thanx @webb-ben that comment helped me to figure out what happens

apparently we have on demo

Disallow: *f=json-ld*

so the json ld is not read by google, it explains why it works for you, not us

Nov 08 '23 13:11 pvgenuchten

pygeoapi pygeoapi copied to clipboard

Embed jsonld snippet directly, without requesting with ajax

pygeoapi
pygeoapi copied to clipboard