ckanext-dcat Datasets not found on Google Dataset Search

Hi,

I am running CKAN 2.9.2 on Ubuntu 20 and I installed the DCAT plugin. I followed the instructions on the README file (activating the structured_data and dcat plugins) in order to have my Datasets discovered by Google Dataset Search but this has not happened until now.

What could I be missing?

Best regards

Aug 03 '21 17:08 ghost

Did you verify if the structured data is generated in the frontend (i.e. view source and check for a json+ld block)? Maybe you have customized your frontend?

Then you could check if the schema validator indicates any errors for your domain (test with the URL of a dataset).

Nov 05 '21 16:11 metaodi

Hi @metaodi and thank you for your answer. The validator does not indicate any error and it seems my urls are correct.

Nov 24 '21 10:11 maxclac

We also had some issues with indexing datasets by Google Dataset Search. Only a few datasets get indexed.

Jan 24 '22 07:01 anuveyatsu

Maybe google dataset search require standard JSON-LD structure for indexing https://developers.google.com/search/docs/advanced/structured-data/dataset#example

Feb 10 '22 06:02 sagargg

@sagargg this is exactly what this extension provides. But it's hard to tell what went wrong with no further details.

Is the JSON+LD block generated?
Do you have a robots.txt?
Does your site submit a sitemap to Google?

Feb 10 '22 08:02 metaodi

Thank you @metaodi for your answer.

The JSON+LD is correctly formed. As I have no former experience with letting crawlers access a website, I was not aware of the necessity to take care of a robots.txt file and a sitemap. I realized it is important to read the Google Search guidelines before using the extension. Are there CKAN-specific instructions about setting up a robots.txt and a sitemap?

Feb 10 '22 08:02 maxclac

No there is nothing CKAN specific. We use this extension on the open data catalogue of the City of Zurich, and it works for us.

See the Google Dataset Search help page for specific instructions: https://datasetsearch.research.google.com/help

Hope this helps.

Feb 10 '22 09:02 metaodi

Thanks! Is a robots.txt really needed? I thought that, when none is given, Google would just crawl everything.

Feb 10 '22 11:02 maxclac

No, it's not necessary. But since I don't know your setup, it could be that an existing robots.txt is blocking the google crawler.

Just something to keep in mind.

Feb 10 '22 11:02 metaodi

I see. I am not aware of any pre-existing robots.txt in my CKAN instance. Maybe if I explicitly put one, the indexing will work.

Feb 10 '22 11:02 maxclac

ckanext-dcat ckanext-dcat copied to clipboard

Datasets not found on Google Dataset Search

ckanext-dcat
ckanext-dcat copied to clipboard