ckanext-dcat
ckanext-dcat copied to clipboard
Datasets not found on Google Dataset Search
Hi,
I am running CKAN 2.9.2 on Ubuntu 20 and I installed the DCAT plugin. I followed the instructions on the README file (activating the structured_data
and dcat
plugins) in order to have my Datasets discovered by Google Dataset Search but this has not happened until now.
What could I be missing?
Best regards
Did you verify if the structured data is generated in the frontend (i.e. view source and check for a json+ld block)? Maybe you have customized your frontend?
Then you could check if the schema validator indicates any errors for your domain (test with the URL of a dataset).
Hi @metaodi and thank you for your answer. The validator does not indicate any error and it seems my urls are correct.
We also had some issues with indexing datasets by Google Dataset Search. Only a few datasets get indexed.
Maybe google dataset search require standard JSON-LD structure for indexing https://developers.google.com/search/docs/advanced/structured-data/dataset#example
@sagargg this is exactly what this extension provides. But it's hard to tell what went wrong with no further details.
- Is the JSON+LD block generated?
- Do you have a robots.txt?
- Does your site submit a sitemap to Google?
Thank you @metaodi for your answer.
The JSON+LD is correctly formed. As I have no former experience with letting crawlers access a website, I was not aware of the necessity to take care of a robots.txt file and a sitemap. I realized it is important to read the Google Search guidelines before using the extension. Are there CKAN-specific instructions about setting up a robots.txt and a sitemap?
No there is nothing CKAN specific. We use this extension on the open data catalogue of the City of Zurich, and it works for us.
See the Google Dataset Search help page for specific instructions: https://datasetsearch.research.google.com/help
Hope this helps.
Thanks! Is a robots.txt really needed? I thought that, when none is given, Google would just crawl everything.
No, it's not necessary. But since I don't know your setup, it could be that an existing robots.txt is blocking the google crawler.
Just something to keep in mind.
I see. I am not aware of any pre-existing robots.txt in my CKAN instance. Maybe if I explicitly put one, the indexing will work.