pygeoapi icon indicating copy to clipboard operation
pygeoapi copied to clipboard

Embed jsonld snippet directly, without requesting with ajax

Open pvgenuchten opened this issue 3 years ago • 6 comments

Goal is to:

  • ggl rich snippets test does not parse the ajax response (not clear if rich snippets test is fully representative for google structured data crawler)
  • better performance

This is a work in progress, now implemented for /collections/foo and collections/foo/items/faa

pvgenuchten avatar May 30 '22 21:05 pvgenuchten

The code uses copy.deepcopy to copy the content item, to be used for jsonld generation. This relates to the fact that the content item seems altered during jsonld conversion

I'm using metadata.license directly from config, as fcmld is not available on html api class

pvgenuchten avatar May 30 '22 21:05 pvgenuchten

Not sure what the search engines are doing (so whether or not this is required for structured data parsing), but the idea of using AJAX to request it separately is already to improve performance by making the request for that data separate from the page load itself. That is, a human can see the page as quickly as possible, and then the search-engine-oriented data component can load in the background. One hopes a search engine is using conneg to preferentially load JSON-LD over HTML, or at least looking at rel=alternate or something.

Introducing a new way to get JSON-LD representation that is not via conneg seems a little worrying (e.g. "as fcmld is not available on html api class").

alpha-beta-soup avatar May 31 '22 01:05 alpha-beta-soup

Hi all,

Why not just use python machinery to embed the result of a URL request for the json-ld. I was able to do this pretty easily here. For the sake of proof of concept, I opted to use requests which I think breaks some things.

webb-ben avatar Jun 07 '22 20:06 webb-ben

Why not just use python machinery to embed the result of a URL request for the json-ld.

No need to make an actual request, this PR actually does the same thing by calling the function that normally outputs the json-ld

pvgenuchten avatar Jun 11 '22 19:06 pvgenuchten

One hopes a search engine is using conneg to preferentially load JSON-LD over HTML, or at least looking at rel=alternate or something.

unfortunately no, both yandex and ggl (haven't tried bing yet) seem to not use conneg or rel=alternate

pvgenuchten avatar Jun 11 '22 19:06 pvgenuchten

Introducing a new way to get JSON-LD representation that is not via conneg seems a little worrying (e.g. "as fcmld is not available on html api class").

Note that this PR does not add a http mechanism to get json-ld representation, it adds json-ld content to an existing http mechanism. I do not know the consequences of "fcmld is not available on html api class"

pvgenuchten avatar Jun 11 '22 19:06 pvgenuchten

@pvgenuchten @alpha-beta-soup @webb-ben how well is XHR supported in search engine crawlers? This will help us determine the direction on whether to XHR or embed.

tomkralidis avatar Jun 26 '23 14:06 tomkralidis

Why not just use python machinery to embed the result of a URL request for the json-ld.

Yes, it would work, but I’d prefer not to add another http request, if we can manage fully on python side

pvgenuchten avatar Jun 27 '23 05:06 pvgenuchten

XHR is supported by the google crawler image

Think there is a bigger question at hand regarding best practice for SEO and crawlability (see https://github.com/geopython/pygeoapi/issues/877 and https://github.com/geopython/pygeoapi/pull/902). SEO is graded in part based on how long it takes for the initial page to load - to that extent the XHR implementation is beneficial.

Separately requesting the JSON-LD via XHR request does increase the CPU processing time. This PR pumps the already computed result into the json-ld script tag.

webb-ben avatar Jun 28 '23 16:06 webb-ben

@webb-ben you indicate xhr is supported? however my impression is that XHR is not supported by google to ingest the specific json-ld payload embedded in html, because in that scenario i'd expect the rich results test to identify a schema.org/dataset on https://demo.pygeoapi.io/master/collections/obs?f=html, which it does not:

image

However when i inject the jsonld block manually, and run the test again, it works fine

image

pvgenuchten avatar Jul 02 '23 14:07 pvgenuchten

When I try to crawl https://demo.pygeoapi.io/master/collections/obs?f=html (ref), I get that it is blocked by robots.txt.

However, crawling https://reference.geoconnex.us/collections/nat_aq?f=html running 0.15.dev0 (ref) and crawling https://features.internetofwater.dev/collections/NPDES?f=html running 0.16.dev0 (ref) appear to work as expected

webb-ben avatar Jul 02 '23 15:07 webb-ben

The Google Rich Result test as a test needs to be better clarified in this conversation. For instance, I was not able to find a version of pygeoapi landing page with any detect rich text. So I went through the OGC API feature hypermedia pattern:

API Endpoint Rich Result Test URL Result
/ Test URL 1
/collections Test URL 2
/collections/NPDES Test URL 3
/collections/NPDES/items Test URL 4
/collections/NPDES/items/04R10I001 Test URL 5

These all have valid structured data in the schema.org validator that google provides a redirect to. That is in part because some of these pages do not constitute an acceptable webpage type and thus is ignored by the crawler.

webb-ben avatar Jul 13 '23 17:07 webb-ben

Thanx @webb-ben that comment helped me to figure out what happens

apparently we have on demo

Disallow: *f=json-ld*

so the json ld is not read by google, it explains why it works for you, not us

pvgenuchten avatar Nov 08 '23 13:11 pvgenuchten