dspace-angular
dspace-angular copied to clipboard
Avoid excess load of bots going into search facet links on entity pages
Describe the bug We're seeing in search console for several of our clients that bots go into facet links on entity pages. Given that this doesn't contribute to the quality of the indexing (e.g. bots shouldn't be going there) and that processing these requests is resource intensive, we better avoid this behaviour al together.
To Reproduce Steps to reproduce the behavior:
- Look at search console for an actively indexed DSpace 7 site, that has entities enabled
- Look for the patterns in the reports of crawled urls for things like:
entities/orgunit/25913818-6714-4be5-89a6-f70c8facdf7e?f.author=Wang
Expected behavior Robots should be blocked from doing this
Proposed solution Add following disallow directive in robots.txt:
Disallow: /entities/*?f
Related work
Previously incorrectly created in the back-end Git repo as https://github.com/DSpace/DSpace/issues/9227
Thanks @bram-atmire! I can imagine this is a huge load (like crawling search and browse as well) and an obvious win for bots that respect robots.txt. I'm wondering if Google's interpretation of the robot exclusion protocol supports wildcards such as this after path elements. It seems maybe? Have you tried it on a live site?
As a sysadmin I'd block these patterns in Apache / nginx just to be sure—as the Russian saying goes: "trust, but verify".
Side note, we have several patterns with trailing wildcards that will be ignored by Google bot.
@alanorth As far as I cansee, as long as the wild card isn't trailing, it shouldn't be ignored.
The change in this ticket came up in an email dialogue with a representative from Google Scholar.
One site where we have it in prod: https://repository.upenn.edu/robots.txt
Wouldn't it be useful to (additionally) use add the rel="nofollow"
attribute to the anchor tags in the search filters? This way we don't have to rely on how wildcards are handled by crawlers
@hutattedonmyarm if we use rel="nofollow"
on search pages it would be a sign for bots to not crawl them, but they still have to load the page to read the anchor tags. In theory the robots.txt
method should be better because bots can read it before.
@alanorth Not the whole page, I was only talking about the links in search-filters.component
. So the checkboxes which check/uncheck all the filters in the search results sidebar. These are implemented as links. Currently, crawlers follow them, because they're part of an entities page. But they only lead to search results
@hutattedonmyarm oh yes, I was confusing the rel=nofollow
with other robot instructions in head meta tags. I think you are right that we should make those links rel=nofollow
.