No robots.txt, sitemap.xml in the web root
Issue Type
- [x] 🐛 Bug / Problem
- [ ] ✏️ Typo / Grammar
- [ ] 📖 Outdated Content
- [ ] 🚀 Enhancement
Generated by Generative AI
No response
Distribution
No response
Description
I see that make multiversion generates a sitemap.xml for every version of the documentation, however these xml files 404 on the deployed web version. Is someone manually telling google to index pages? Why is there no robots.txt + sitemap.xml visible on the web version?
The deployed web version 404's on the following:
https://docs.ros.org/sitemap.xml -> this one actually is generated in build/html
https://docs.ros.org/robots.txt -> this one doesn't get generated
This is related to the discussion here: https://discourse.openrobotics.org/t/discoverability-of-documentation-on-search-engines/51059
https://github.com/ros-infrastructure/rosindex/issues/552 was also mentioned there. I also noticed that index.ros.org has a proper /robots.txt and sitemap.xml.
Another related problem is that the generated sitemap.xml for versions don't include all of the automatically generated documents that are hosted on docs.ros.org like https://docs.ros.org/en/humble/p/tf2/, so I wonder if they aren't being indexed by google
Affected Pages/Sections
No response
Screenshots or Examples (if applicable)
No response
Suggested Fix
I think there's a problem with whatever deployment configuration of the web server that prevents sitemap.xml from being reached. In addition a robots.txt pointing to the sitemap index would be helpful for search engines as well.
I'm not sure how the package specific docs are created but those should also be in the sitemap
Additional Context
No response
This issue has been mentioned on Open Robotics Discourse. There might be relevant details there:
https://discourse.openrobotics.org/t/discoverability-of-documentation-on-search-engines/51059/7
If someone has access to the google search console and could post some screenshots of what google actually sees when it looks at docs.ros.org that would also be helpful in diagnosing why a lot of ros documentation results are not available on google
Thanks for the suggestion.
I am not sure if we have the Google Search Console setup in the first place. It looks like we would need to first enable it and then figure out how we want to manage access. Let me ping the infra team and see what we can do. We've also got some internal efforts happening and I want to make sure we don't step on those toes.
I've been watching it for index.ros.org but don't have docs.ros.org in my view. I'll reach out to OSRF to see if I can take a look at this.
From triage meeting: assigning to @gbiggs for delegation to the right person
@gbiggs got me access to the search console. The multiversion site maps are all being indexed successfully
robots.txt is a blocking mechanism and a lack of it won't prevent things from being indexed.
I'm going to close this as the generated sitemaps are working fine. But I'll also open some new issues from insights in the console in other places that I've noticed.