pygeoapi
pygeoapi copied to clipboard
Embed jsonld snippet directly, without requesting with ajax
Goal is to:
- ggl rich snippets test does not parse the ajax response (not clear if rich snippets test is fully representative for google structured data crawler)
- better performance
This is a work in progress, now implemented for /collections/foo and collections/foo/items/faa
The code uses copy.deepcopy to copy the content item, to be used for jsonld generation. This relates to the fact that the content item seems altered during jsonld conversion
I'm using metadata.license directly from config, as fcmld is not available on html api class
Not sure what the search engines are doing (so whether or not this is required for structured data parsing), but the idea of using AJAX to request it separately is already to improve performance by making the request for that data separate from the page load itself. That is, a human can see the page as quickly as possible, and then the search-engine-oriented data component can load in the background. One hopes a search engine is using conneg to preferentially load JSON-LD over HTML, or at least looking at rel=alternate or something.
Introducing a new way to get JSON-LD representation that is not via conneg seems a little worrying (e.g. "as fcmld is not available on html api class").
Hi all,
Why not just use python machinery to embed the result of a URL request for the json-ld. I was able to do this pretty easily here. For the sake of proof of concept, I opted to use requests which I think breaks some things.
Why not just use python machinery to embed the result of a URL request for the json-ld.
No need to make an actual request, this PR actually does the same thing by calling the function that normally outputs the json-ld
One hopes a search engine is using conneg to preferentially load JSON-LD over HTML, or at least looking at rel=alternate or something.
unfortunately no, both yandex and ggl (haven't tried bing yet) seem to not use conneg or rel=alternate
Introducing a new way to get JSON-LD representation that is not via conneg seems a little worrying (e.g. "as fcmld is not available on html api class").
Note that this PR does not add a http mechanism to get json-ld representation, it adds json-ld content to an existing http mechanism. I do not know the consequences of "fcmld is not available on html api class"
@pvgenuchten @alpha-beta-soup @webb-ben how well is XHR supported in search engine crawlers? This will help us determine the direction on whether to XHR or embed.
Why not just use python machinery to embed the result of a URL request for the json-ld.
Yes, it would work, but I’d prefer not to add another http request, if we can manage fully on python side
XHR is supported by the google crawler
Think there is a bigger question at hand regarding best practice for SEO and crawlability (see https://github.com/geopython/pygeoapi/issues/877 and https://github.com/geopython/pygeoapi/pull/902). SEO is graded in part based on how long it takes for the initial page to load - to that extent the XHR implementation is beneficial.
Separately requesting the JSON-LD via XHR request does increase the CPU processing time. This PR pumps the already computed result into the json-ld script tag.
@webb-ben you indicate xhr is supported? however my impression is that XHR is not supported by google to ingest the specific json-ld payload embedded in html, because in that scenario i'd expect the rich results test to identify a schema.org/dataset on https://demo.pygeoapi.io/master/collections/obs?f=html, which it does not:
However when i inject the jsonld block manually, and run the test again, it works fine
When I try to crawl https://demo.pygeoapi.io/master/collections/obs?f=html (ref), I get that it is blocked by robots.txt.
However, crawling https://reference.geoconnex.us/collections/nat_aq?f=html running 0.15.dev0 (ref) and crawling https://features.internetofwater.dev/collections/NPDES?f=html running 0.16.dev0 (ref) appear to work as expected
The Google Rich Result test as a test needs to be better clarified in this conversation. For instance, I was not able to find a version of pygeoapi landing page with any detect rich text. So I went through the OGC API feature hypermedia pattern:
| API Endpoint | Rich Result Test URL | Result |
|---|---|---|
/ |
Test URL 1 | ❌ |
/collections |
Test URL 2 | ✅ |
/collections/NPDES |
Test URL 3 | ✅ |
/collections/NPDES/items |
Test URL 4 | ❌ |
/collections/NPDES/items/04R10I001 |
Test URL 5 | ❌ |
These all have valid structured data in the schema.org validator that google provides a redirect to. That is in part because some of these pages do not constitute an acceptable webpage type and thus is ignored by the crawler.
Thanx @webb-ben that comment helped me to figure out what happens
apparently we have on demo
Disallow: *f=json-ld*
so the json ld is not read by google, it explains why it works for you, not us